** Description changed:
[Impact]
On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
On a Dell system with MediaTek MT7925 WiFi (or similar affected platform):
- 1. Run the Checkbox certification test suite that triggers the issue:
- $ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
+ 1. Install fwts if not already available:
+ $ sudo apt-get install fwts
- 2. Specifically execute the firmware test cases that previously triggered the deadlock:
- - firmware/fwts_desktop_diagnosis
- - firmware/fwts_wakealarm.*
- - firmware/fwts_uefirtvariable.*
- - miscellanea/oops
+ 2. Monitor system logs in a separate terminal:
+ $ sudo dmesg -w
- 3. Monitor system logs for the previously observed symptoms:
- $ sudo dmesg -w
+ 3. Run the firmware test cases that previously triggered the deadlock:
- Without the fix, you would see:
- - "Message 00020080 (seq N) timeout" from mt7925e
- - "workqueue: vmstat_update hogged CPU for >10000us" warnings
- - "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- - WARNING traces in iommu_dma_unmap_page
- - System becoming unresponsive
+ $ sudo fwts wakealarm
+ $ sudo fwts uefirtvariable
+ $ sudo fwts oops
- With the fix, these symptoms should not occur and the system should
+ Or run a comprehensive diagnostic test:
+ $ sudo fwts --log-level=high -r stdout
+
+ 4. Check for symptoms during and after the tests:
+
+ Without the fix, you would see:
+ - "Message 00020080 (seq N) timeout" from mt7925e
+ - "workqueue: vmstat_update hogged CPU for >10000us" warnings
+ - "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
+ - WARNING traces in iommu_dma_unmap_page
+ - System becoming unresponsive
+
+ With the fix, these symptoms should not occur and the system should
remain responsive.
- 4. Run extended stress testing with WiFi activity during high CPU load:
- $ stress-ng --cpu 128 --timeout 300s &
- $ ping -f <router_ip> # flood ping to generate WiFi traffic
+ 5. Run extended stress testing with WiFi activity during high CPU load:
+ $ stress-ng --cpu 128 --timeout 300s &
+ $ ping -f <router_ip> # flood ping to generate WiFi traffic
- The system should remain stable without deadlocks.
-
- 5. Verify WiFi power management still functions correctly:
- $ iw dev <interface> get power_save
- $ # Enable/disable power save and verify WiFi connectivity remains stable
- $ sudo iw dev <interface> set power_save on
- $ ping -c 100 <router_ip>
+ The system should remain stable without deadlocks.
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2137448
Title:
System doesn't response with mt76 call trace
Status in linux package in Ubuntu:
In Progress
Status in linux-oem-6.14 package in Ubuntu:
Invalid
Status in linux-oem-6.17 package in Ubuntu:
Invalid
Status in linux source package in Noble:
In Progress
Status in linux-oem-6.14 source package in Noble:
In Progress
Status in linux-oem-6.17 source package in Noble:
In Progress
Status in linux source package in Questing:
In Progress
Status in linux-oem-6.14 source package in Questing:
Invalid
Status in linux-oem-6.17 source package in Questing:
Invalid
Status in linux source package in Resolute:
In Progress
Status in linux-oem-6.14 source package in Resolute:
Invalid
Status in linux-oem-6.17 source package in Resolute:
Invalid
Bug description:
[Impact]
On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
On a Dell system with MediaTek MT7925 WiFi (or similar affected platform):
1. Install fwts if not already available:
$ sudo apt-get install fwts
2. Monitor system logs in a separate terminal:
$ sudo dmesg -w
3. Run the firmware test cases that previously triggered the deadlock:
$ sudo fwts wakealarm
$ sudo fwts uefirtvariable
$ sudo fwts oops
Or run a comprehensive diagnostic test:
$ sudo fwts --log-level=high -r stdout
4. Check for symptoms during and after the tests:
Without the fix, you would see:
- "Message 00020080 (seq N) timeout" from mt7925e
- "workqueue: vmstat_update hogged CPU for >10000us" warnings
- "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- WARNING traces in iommu_dma_unmap_page
- System becoming unresponsive
With the fix, these symptoms should not occur and the system should
remain responsive.
5. Run extended stress testing with WiFi activity during high CPU load:
$ stress-ng --cpu 128 --timeout 300s &
$ ping -f <router_ip> # flood ping to generate WiFi traffic
The system should remain stable without deadlocks.
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2137448/+subscriptions
Комментариев нет:
Отправить комментарий