Public bug reported:
[Impact]
On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
On a Dell system with MediaTek MT7925 WiFi (or similar affected platform):
1. Run the Checkbox certification test suite that triggers the issue:
$ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
2. Specifically execute the firmware test cases that previously triggered the deadlock:
- firmware/fwts_desktop_diagnosis
- firmware/fwts_wakealarm.*
- firmware/fwts_uefirtvariable.*
- miscellanea/oops
3. Monitor system logs for the previously observed symptoms:
$ sudo dmesg -w
Without the fix, you would see:
- "Message 00020080 (seq N) timeout" from mt7925e
- "workqueue: vmstat_update hogged CPU for >10000us" warnings
- "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- WARNING traces in iommu_dma_unmap_page
- System becoming unresponsive
With the fix, these symptoms should not occur and the system should
remain responsive.
4. Run extended stress testing with WiFi activity during high CPU load:
$ stress-ng --cpu 128 --timeout 300s &
$ ping -f <router_ip> # flood ping to generate WiFi traffic
The system should remain stable without deadlocks.
5. Verify WiFi power management still functions correctly:
$ iw dev <interface> get power_save
$ # Enable/disable power save and verify WiFi connectivity remains stable
$ sudo iw dev <interface> set power_save on
$ ping -c 100 <router_ip>
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
** Affects: linux (Ubuntu)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux-oem-6.14 (Ubuntu)
Importance: Undecided
Status: Invalid
** Affects: linux-oem-6.17 (Ubuntu)
Importance: Undecided
Status: Invalid
** Affects: linux (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux-oem-6.14 (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux (Ubuntu Questing)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux-oem-6.14 (Ubuntu Questing)
Importance: Undecided
Status: Invalid
** Affects: linux-oem-6.17 (Ubuntu Questing)
Importance: Undecided
Status: Invalid
** Affects: linux (Ubuntu Resolute)
Importance: Undecided
Assignee: AceLan Kao (acelankao)
Status: In Progress
** Affects: linux-oem-6.14 (Ubuntu Resolute)
Importance: Undecided
Status: Invalid
** Affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: Invalid
** Also affects: linux-oem-6.14 (Ubuntu)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Questing)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.14 (Ubuntu Questing)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Questing)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.14 (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Noble)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.14 (Ubuntu Noble)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided
Status: New
** Changed in: linux-oem-6.14 (Ubuntu Questing)
Status: New => Invalid
** Changed in: linux-oem-6.14 (Ubuntu Resolute)
Status: New => Invalid
** Changed in: linux-oem-6.17 (Ubuntu Questing)
Status: New => Invalid
** Changed in: linux-oem-6.17 (Ubuntu Resolute)
Status: New => Invalid
** Changed in: linux (Ubuntu Noble)
Status: New => In Progress
** Changed in: linux (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)
** Changed in: linux (Ubuntu Questing)
Status: New => In Progress
** Changed in: linux (Ubuntu Questing)
Assignee: (unassigned) => AceLan Kao (acelankao)
** Changed in: linux (Ubuntu Resolute)
Status: New => In Progress
** Changed in: linux (Ubuntu Resolute)
Assignee: (unassigned) => AceLan Kao (acelankao)
** Changed in: linux-oem-6.14 (Ubuntu Noble)
Status: New => In Progress
** Changed in: linux-oem-6.14 (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)
** Changed in: linux-oem-6.17 (Ubuntu Noble)
Status: New => In Progress
** Changed in: linux-oem-6.17 (Ubuntu Noble)
Assignee: (unassigned) => AceLan Kao (acelankao)
** Description changed:
[Impact]
- On Dell Brickyard systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
+ On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
- mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
- mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
+ mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
+ mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
This affects Dell Pro Max Tower T4/T6 systems during firmware tests
(FWTS) and other certification testing, with a failure rate of 3/5
attempts, making it a critical issue for platform certification.
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
On a Dell Brickyard system with MediaTek MT7925 WiFi (or similar affected platform):
1. Run the Checkbox certification test suite that triggers the issue:
- $ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
+ $ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
2. Specifically execute the firmware test cases that previously triggered the deadlock:
- - firmware/fwts_desktop_diagnosis
- - firmware/fwts_wakealarm.*
- - firmware/fwts_uefirtvariable.*
- - miscellanea/oops
+ - firmware/fwts_desktop_diagnosis
+ - firmware/fwts_wakealarm.*
+ - firmware/fwts_uefirtvariable.*
+ - miscellanea/oops
3. Monitor system logs for the previously observed symptoms:
- $ sudo dmesg -w
+ $ sudo dmesg -w
- Without the fix, you would see:
- - "Message 00020080 (seq N) timeout" from mt7925e
- - "workqueue: vmstat_update hogged CPU for >10000us" warnings
- - "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- - WARNING traces in iommu_dma_unmap_page
- - System becoming unresponsive
+ Without the fix, you would see:
+ - "Message 00020080 (seq N) timeout" from mt7925e
+ - "workqueue: vmstat_update hogged CPU for >10000us" warnings
+ - "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
+ - WARNING traces in iommu_dma_unmap_page
+ - System becoming unresponsive
- With the fix, these symptoms should not occur and the system should
+ With the fix, these symptoms should not occur and the system should
remain responsive.
4. Run extended stress testing with WiFi activity during high CPU load:
- $ stress-ng --cpu 128 --timeout 300s &
- $ ping -f <router_ip> # flood ping to generate WiFi traffic
+ $ stress-ng --cpu 128 --timeout 300s &
+ $ ping -f <router_ip> # flood ping to generate WiFi traffic
- The system should remain stable without deadlocks.
+ The system should remain stable without deadlocks.
5. Verify WiFi power management still functions correctly:
- $ iw dev <interface> get power_save
- $ # Enable/disable power save and verify WiFi connectivity remains stable
- $ sudo iw dev <interface> set power_save on
- $ ping -c 100 <router_ip>
+ $ iw dev <interface> get power_save
+ $ # Enable/disable power save and verify WiFi connectivity remains stable
+ $ sudo iw dev <interface> set power_save on
+ $ ping -c 100 <router_ip>
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
** Description changed:
[Impact]
On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
-
- This affects Dell Pro Max Tower T4/T6 systems during firmware tests
- (FWTS) and other certification testing, with a failure rate of 3/5
- attempts, making it a critical issue for platform certification.
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
- On a Dell Brickyard system with MediaTek MT7925 WiFi (or similar affected platform):
+ On a Dell system with MediaTek MT7925 WiFi (or similar affected platform):
1. Run the Checkbox certification test suite that triggers the issue:
$ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
2. Specifically execute the firmware test cases that previously triggered the deadlock:
- firmware/fwts_desktop_diagnosis
- firmware/fwts_wakealarm.*
- firmware/fwts_uefirtvariable.*
- miscellanea/oops
3. Monitor system logs for the previously observed symptoms:
$ sudo dmesg -w
Without the fix, you would see:
- "Message 00020080 (seq N) timeout" from mt7925e
- "workqueue: vmstat_update hogged CPU for >10000us" warnings
- "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- WARNING traces in iommu_dma_unmap_page
- System becoming unresponsive
With the fix, these symptoms should not occur and the system should
remain responsive.
4. Run extended stress testing with WiFi activity during high CPU load:
$ stress-ng --cpu 128 --timeout 300s &
$ ping -f <router_ip> # flood ping to generate WiFi traffic
The system should remain stable without deadlocks.
5. Verify WiFi power management still functions correctly:
$ iw dev <interface> get power_save
$ # Enable/disable power save and verify WiFi connectivity remains stable
$ sudo iw dev <interface> set power_save on
$ ping -c 100 <router_ip>
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2137448
Title:
System doesn't response with mt76 call trace
Status in linux package in Ubuntu:
In Progress
Status in linux-oem-6.14 package in Ubuntu:
Invalid
Status in linux-oem-6.17 package in Ubuntu:
Invalid
Status in linux source package in Noble:
In Progress
Status in linux-oem-6.14 source package in Noble:
In Progress
Status in linux-oem-6.17 source package in Noble:
In Progress
Status in linux source package in Questing:
In Progress
Status in linux-oem-6.14 source package in Questing:
Invalid
Status in linux-oem-6.17 source package in Questing:
Invalid
Status in linux source package in Resolute:
In Progress
Status in linux-oem-6.14 source package in Resolute:
Invalid
Status in linux-oem-6.17 source package in Resolute:
Invalid
Bug description:
[Impact]
On Dell systems with MediaTek MT7925 WiFi cards (mt7925e driver), the system becomes unresponsive during firmware testing and high-load situations due to a deadlock in the mt76 driver. The system shows "workqueue hogging CPU" messages followed by system hang, preventing completion of certification testing.
The issue occurs because:
1. Two workqueue functions (ps_work and mac_work) attempt to cancel each other using cancel_delayed_work_sync()
2. In high-load situations, both works get queued but cannot execute until CPUs are available
3. When CPUs become available, both work functions may run simultaneously, each trying to synchronously cancel the other, resulting in a deadlock
The call path that creates the circular dependency is:
mt792x_mac_work() -> ... -> cancel_delayed_work_sync(&pm->ps_work);
mt792x_pm_power_save_work() -> cancel_delayed_work_sync(&mphy->mac_work);
[Fix]
Replace cancel_delayed_work_sync() with cancel_delayed_work() in the mt792x_pm_power_save_work() function to eliminate the deadlock condition.
Upstream commit (submitted to linux-wireless):
https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
The non-synchronous cancel is safe here because:
- The work cancellation is part of the power-save flow, not a critical cleanup path
- Avoiding synchronous wait prevents the circular dependency that causes the deadlock
- The code becomes simpler and easier to maintain
[Test Plan]
On a Dell system with MediaTek MT7925 WiFi (or similar affected platform):
1. Run the Checkbox certification test suite that triggers the issue:
$ checkbox-cli run com.canonical.certification::client-cert-desktop-24-04-automated
2. Specifically execute the firmware test cases that previously triggered the deadlock:
- firmware/fwts_desktop_diagnosis
- firmware/fwts_wakealarm.*
- firmware/fwts_uefirtvariable.*
- miscellanea/oops
3. Monitor system logs for the previously observed symptoms:
$ sudo dmesg -w
Without the fix, you would see:
- "Message 00020080 (seq N) timeout" from mt7925e
- "workqueue: vmstat_update hogged CPU for >10000us" warnings
- "workqueue: psi_avgs_work hogged CPU for >10000us" warnings
- WARNING traces in iommu_dma_unmap_page
- System becoming unresponsive
With the fix, these symptoms should not occur and the system should
remain responsive.
4. Run extended stress testing with WiFi activity during high CPU load:
$ stress-ng --cpu 128 --timeout 300s &
$ ping -f <router_ip> # flood ping to generate WiFi traffic
The system should remain stable without deadlocks.
5. Verify WiFi power management still functions correctly:
$ iw dev <interface> get power_save
$ # Enable/disable power save and verify WiFi connectivity remains stable
$ sudo iw dev <interface> set power_save on
$ ping -c 100 <router_ip>
[Where problems could occur]
This change affects the MediaTek MT792x WiFi driver's power management and workqueue interaction on systems with mt7925e and similar chipsets.
Potential issues if the non-synchronous cancel is not safe in this context:
- If there are assumptions in the code that mac_work must be fully stopped before proceeding, using non-synchronous cancel might allow mac_work to run concurrently with subsequent operations, potentially causing race conditions
- The mac_work might access hardware or data structures that ps_work assumes are quiescent after the cancel call, leading to unexpected behavior or crashes
- Power management state transitions might become inconsistent if mac_work completes after ps_work has already proceeded with its power-save operations
However, these risks are mitigated by:
- The change is intentional and authored by MediaTek engineers who maintain the driver
- The alternative (synchronous cancel) creates a known deadlock issue with 60% reproduction rate
- The workqueue subsystem provides inherent protection against most race conditions
- Similar patterns are used elsewhere in the kernel where work items need to coordinate
The impact is limited to:
- Systems with MediaTek MT792x series WiFi chipsets (mt7921, mt7925, etc.)
- Primarily affects high-load scenarios where both work items are queued simultaneously
- Does not affect other wireless drivers or systems without these chipsets
[Other Info]
Upstream submission: https://patchwork.kernel.org/project/linux-wireless/patch/20251215122231.3180648-1-leon.yen@mediatek.com/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2137448/+subscriptions
Комментариев нет:
Отправить комментарий