вторник

[Bug 2137613] [NEW] TBT call trace while connecting TBT4 monitor on TBT5 port

Public bug reported:

[Impact]

When connecting Thunderbolt devices (especially monitors like Dell
U2725QE), users see alarming kernel backtraces in dmesg during device
enumeration. While the devices eventually work after automatic
reconnection, the call traces cause user concern and can trigger
automated bug reporting tools.

Error log example:
```
[ 36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
[ 36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
[ 36.031605] CPU: 12 UID: 0 PID: 3145 Comm: pool-/usr/libex Tainted: G D W 6.18.0+ #8
[ 36.031610] RIP: 0010:tb_path_activate+0x126/0x530 [thunderbolt]
[ 36.031637] Call Trace:
[ 36.031638] <TASK>
...
```

The issue occurs when:
- Type-C connections have transient electrical issues
- During lane bonding transitions (single lane to dual lane)
- The Thunderbolt port's control channel is temporarily unavailable

The devices typically recover automatically within a few seconds and
work normally, but the kernel backtrace (tb_WARN) is generated
unnecessarily for these expected transient conditions.

Affected hardware:
- Dell U2725QE Thunderbolt monitor (USB4 device 8087:b26)
- Other Thunderbolt/USB4 devices experiencing similar transient connection issues
- AMD and Intel Thunderbolt controllers

[Fix]

Modify tb_path_activate() in drivers/thunderbolt/path.c to differentiate
between expected transient failures and actual errors:

- For -ENOTCONN errors: Use tb_warn() to log the error without generating a kernel backtrace
- For all other errors: Keep tb_WARN() to generate the full call trace for debugging

This approach aligns with the existing comment in
drivers/thunderbolt/ctl.c which states that
TB_CFG_ERROR_PORT_NOT_CONNECTED "can happen during surprise removal" and
we should "not warn" about it.

The fix does not suppress the warning message itself - users and
developers can still see the path activation failure in dmesg. It only
removes the unnecessary kernel backtrace (stack dump, register dump,
etc.) for this specific expected transient condition.

Patch:
https://lore.kernel.org/lkml/20260102031905.27416-1-acelan.kao@canonical.com/T/#u
("thunderbolt: Suppress call trace for transient -ENOTCONN errors during
path activation")


[Test Plan]

Hardware needed:
- Dell U2725QE Thunderbolt monitor or similar Thunderbolt device that exhibits transient connection issues
- System with Thunderbolt 3/4 or USB4 controller

Test steps:

1. Without the patch:
```bash
# Clear dmesg
sudo dmesg -C

# Connect Dell U2725QE or similar Thunderbolt device
# Wait 10 seconds

# Check for call traces
dmesg | grep -A 30 "path activation failed"
```

Expected: You should see a full kernel backtrace with WARNING, RIP,
Call Trace, etc.

2. With the patch:
```bash
# Clear dmesg
sudo dmesg -C

# Connect Dell U2725QE or similar Thunderbolt device
# Wait 10 seconds

# Check for warnings
dmesg | grep "path activation failed"
```

Expected: You should see a simple warning message without the backtrace:
```
thunderbolt 0000:c7:00.6: PCIe Down path activation failed (port not connected)
```

3. Verify device functionality:
```bash
# Check that Thunderbolt device is detected and working
lsusb
lspci

# For monitors, check display output works
xrandr
```

Expected: Device should be detected and functional after the
transient error

4. Test multiple hot-plug cycles (10 times):
- Unplug and replug the Thunderbolt device
- Verify each time that only a simple warning appears (not a full backtrace)
- Verify device works correctly after each reconnection

5. Verify genuine errors still produce backtraces:
- Test conditions that should produce other error codes (not -ENOTCONN)
- Verify those still generate tb_WARN backtraces for debugging

[Where problems could occur]

The patch modifies error reporting in the Thunderbolt path activation
code, which could affect debugging and error handling:

1. **Thunderbolt subsystem**: If there are genuine bugs that manifest as
-ENOTCONN errors (not just transient issues), developers might miss
important debugging information because the full backtrace won't be
generated. This would make it harder to diagnose actual Thunderbolt
controller bugs or firmware issues.

** Affects: linux (Ubuntu)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux-oem-6.14 (Ubuntu)
Importance: Undecided
Status: Invalid

** Affects: linux-oem-6.17 (Ubuntu)
Importance: Undecided
Status: Invalid

** Affects: linux (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux-oem-6.14 (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux (Ubuntu Questing)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux-oem-6.14 (Ubuntu Questing)
Importance: Undecided
Status: Invalid

** Affects: linux-oem-6.17 (Ubuntu Questing)
Importance: Undecided
Status: Invalid

** Affects: linux (Ubuntu Resolute)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress

** Affects: linux-oem-6.14 (Ubuntu Resolute)
Importance: Undecided
Status: Invalid

** Affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: Invalid

** Also affects: linux-oem-6.17 (Ubuntu)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.14 (Ubuntu Questing)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.17 (Ubuntu Questing)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.14 (Ubuntu Noble)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.14 (Ubuntu Resolute)
Importance: Undecided
Status: New

** Also affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: New

** Changed in: linux-oem-6.14 (Ubuntu Noble)
Status: New => In Progress

** Changed in: linux-oem-6.14 (Ubuntu Questing)
Status: New => In Progress

** Changed in: linux-oem-6.14 (Ubuntu Resolute)
Status: New => In Progress

** Changed in: linux-oem-6.14 (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)

** Changed in: linux-oem-6.14 (Ubuntu Questing)
Status: In Progress => Invalid

** Changed in: linux-oem-6.14 (Ubuntu Resolute)
Status: In Progress => Invalid

** Changed in: linux-oem-6.17 (Ubuntu Noble)
Status: New => In Progress

** Changed in: linux-oem-6.17 (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)

** Changed in: linux-oem-6.17 (Ubuntu Questing)
Status: New => Invalid

** Changed in: linux-oem-6.17 (Ubuntu Resolute)
Status: New => Invalid

** Also affects: linux (Ubuntu)
Importance: Undecided
Status: New

** Changed in: linux (Ubuntu Noble)
Status: New => In Progress

** Changed in: linux (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)

** Changed in: linux (Ubuntu Questing)
Status: New => In Progress

** Changed in: linux (Ubuntu Questing)
Assignee: (unassigned) => AceLan Kao (acelankao)

** Changed in: linux (Ubuntu Resolute)
Status: New => In Progress

** Changed in: linux (Ubuntu Resolute)
Assignee: (unassigned) => AceLan Kao (acelankao)

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2137613

Title:
TBT call trace while connecting TBT4 monitor on TBT5 port

Status in linux package in Ubuntu:
In Progress
Status in linux-oem-6.14 package in Ubuntu:
Invalid
Status in linux-oem-6.17 package in Ubuntu:
Invalid
Status in linux source package in Noble:
In Progress
Status in linux-oem-6.14 source package in Noble:
In Progress
Status in linux-oem-6.17 source package in Noble:
In Progress
Status in linux source package in Questing:
In Progress
Status in linux-oem-6.14 source package in Questing:
Invalid
Status in linux-oem-6.17 source package in Questing:
Invalid
Status in linux source package in Resolute:
In Progress
Status in linux-oem-6.14 source package in Resolute:
Invalid
Status in linux-oem-6.17 source package in Resolute:
Invalid

Bug description:
[Impact]

When connecting Thunderbolt devices (especially monitors like Dell
U2725QE), users see alarming kernel backtraces in dmesg during device
enumeration. While the devices eventually work after automatic
reconnection, the call traces cause user concern and can trigger
automated bug reporting tools.

Error log example:
```
[ 36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
[ 36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
[ 36.031605] CPU: 12 UID: 0 PID: 3145 Comm: pool-/usr/libex Tainted: G D W 6.18.0+ #8
[ 36.031610] RIP: 0010:tb_path_activate+0x126/0x530 [thunderbolt]
[ 36.031637] Call Trace:
[ 36.031638] <TASK>
...
```

The issue occurs when:
- Type-C connections have transient electrical issues
- During lane bonding transitions (single lane to dual lane)
- The Thunderbolt port's control channel is temporarily unavailable

The devices typically recover automatically within a few seconds and
work normally, but the kernel backtrace (tb_WARN) is generated
unnecessarily for these expected transient conditions.

Affected hardware:
- Dell U2725QE Thunderbolt monitor (USB4 device 8087:b26)
- Other Thunderbolt/USB4 devices experiencing similar transient connection issues
- AMD and Intel Thunderbolt controllers

[Fix]

Modify tb_path_activate() in drivers/thunderbolt/path.c to
differentiate between expected transient failures and actual errors:

- For -ENOTCONN errors: Use tb_warn() to log the error without generating a kernel backtrace
- For all other errors: Keep tb_WARN() to generate the full call trace for debugging

This approach aligns with the existing comment in
drivers/thunderbolt/ctl.c which states that
TB_CFG_ERROR_PORT_NOT_CONNECTED "can happen during surprise removal"
and we should "not warn" about it.

The fix does not suppress the warning message itself - users and
developers can still see the path activation failure in dmesg. It only
removes the unnecessary kernel backtrace (stack dump, register dump,
etc.) for this specific expected transient condition.

Patch:
https://lore.kernel.org/lkml/20260102031905.27416-1-acelan.kao@canonical.com/T/#u
("thunderbolt: Suppress call trace for transient -ENOTCONN errors
during path activation")


[Test Plan]

Hardware needed:
- Dell U2725QE Thunderbolt monitor or similar Thunderbolt device that exhibits transient connection issues
- System with Thunderbolt 3/4 or USB4 controller

Test steps:

1. Without the patch:
```bash
# Clear dmesg
sudo dmesg -C

# Connect Dell U2725QE or similar Thunderbolt device
# Wait 10 seconds

# Check for call traces
dmesg | grep -A 30 "path activation failed"
```

Expected: You should see a full kernel backtrace with WARNING, RIP,
Call Trace, etc.

2. With the patch:
```bash
# Clear dmesg
sudo dmesg -C

# Connect Dell U2725QE or similar Thunderbolt device
# Wait 10 seconds

# Check for warnings
dmesg | grep "path activation failed"
```

Expected: You should see a simple warning message without the backtrace:
```
thunderbolt 0000:c7:00.6: PCIe Down path activation failed (port not connected)
```

3. Verify device functionality:
```bash
# Check that Thunderbolt device is detected and working
lsusb
lspci

# For monitors, check display output works
xrandr
```

Expected: Device should be detected and functional after the
transient error

4. Test multiple hot-plug cycles (10 times):
- Unplug and replug the Thunderbolt device
- Verify each time that only a simple warning appears (not a full backtrace)
- Verify device works correctly after each reconnection

5. Verify genuine errors still produce backtraces:
- Test conditions that should produce other error codes (not -ENOTCONN)
- Verify those still generate tb_WARN backtraces for debugging

[Where problems could occur]

The patch modifies error reporting in the Thunderbolt path activation
code, which could affect debugging and error handling:

1. **Thunderbolt subsystem**: If there are genuine bugs that manifest
as -ENOTCONN errors (not just transient issues), developers might miss
important debugging information because the full backtrace won't be
generated. This would make it harder to diagnose actual Thunderbolt
controller bugs or firmware issues.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2137613/+subscriptions

Комментариев нет:

Отправить комментарий