вторник

[Bug 2137755] Re: System hangs during stress-ng stack test

SRU for Q - https://lists.ubuntu.com/archives/kernel- team/2026-April/166868.html -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2137755 Title: System hangs during stress-ng stack test Status in HWE Next: New Status in linux package in Ubuntu: Invalid Status in linux-oem-6.17 package in Ubuntu: Invalid Status in linux source package in Noble: Invalid Status in linux-oem-6.17 source package in Noble: Fix Released Status in linux source package in Questing: In Progress Status in linux-oem-6.17 source package in Questing: Invalid Bug description: [Impact] stress-ng memory stress test fails with stack stressor timeout on Dell systems (CID: 202511-38062) running kernel 6.17.0-1007-oem. The stack stressor, which creates heavy memory pressure and swap activity, consistently times out after running for the expected duration. The issue occurs because the swap allocator uses an incorrect index when retrying swap cache reclaim after encountering a race condition. During heavy memory pressure (such as generated by the stack stressor), the allocator reclaims cached swap slots while scanning. If it finds a folio that's already removed from the swap cache due to a race, it retries - but the retry uses the wrong index, which can lead to: 1. Reclaiming irrelevant swap folios instead of the intended ones 2. Inefficient swap reclaim behavior under memory pressure 3. Performance degradation that causes stress tests to timeout Affected hardware: Dell systems (CID: 202511-38062) with high core count and memory configurations Failure rate: 100% (2/2 test runs failed) [Fix] Upstream commit a733d8de7f1cc ("mm, swap: fix swap cache index error when retrying reclaim") fixes the swap cache index handling. The fix makes two key changes: 1. Makes the `entry` variable const to prevent incorrect reassignment 2. Uses `folio->swap` directly when updating the offset after retrying, instead of using the stale `entry` variable This ensures that when the allocator retries after a race condition, it uses the correct swap cache index from the locked folio, preventing reclaim of irrelevant folios. The patch is upstream in mainline kernel v6.18 and reviewed by multiple memory management maintainers. Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com Fixes: fae859550531 ("mm, swap: avoid reclaiming irrelevant swap cache") [Test Plan] On affected Dell systems (CID: 202511-38062) or similar systems with high core count and memory: 1. Install kernel with the fix 2. Run the stress test: ``` # Run stress-ng with stack stressor stress-ng --aggressive --verify --oom-avoid-bytes 10% --timeout 920 --stack 8 ``` 3. Monitor the test execution: - The test should complete within the expected 920 second timeout - Check that stress-ng reports "successful run completed" for the stack stressor Without the patch: - stress-ng stack stressor times out and is forcefully terminated - System may hang if the stress-ng process fails to be killed With the patch: - stress-ng stack stressor completes within timeout period 4. Optionally verify swap activity during the test: ``` # Monitor swap usage watch -n 1 'free -h && cat /proc/swaps' ``` Swap should be actively used and reclaimed without unusual delays. [Where problems could occur] The changes affect the swap file subsystem's reclaim logic in mm/swapfile.c, specifically the __try_to_reclaim_swap() function. If the fix introduces incorrect behavior: 1. **Incorrect folio identification**: If `folio->swap` doesn't properly reflect the current state after locking, the code might still reclaim the wrong folio. However, this is unlikely since the folio is locked and the swap entry is validated before use. 2. **Performance regression**: The change from using a cached `entry` value to dereferencing `folio->swap` multiple times could theoretically impact performance. However, this should be negligible as the additional dereferences only occur in the retry path (race condition case) which is not the common case. 3. **Const qualifier issues**: Making `entry` const prevents reassignment. If there were other code paths that relied on reassigning `entry` (not visible in the upstream patch), compilation would fail. However, the upstream kernel has this change merged and tested. 4. **Backport conflicts**: The backport required manual resolution because the target branch still has an `address_space` variable that was removed upstream. If the resolution was incorrect, swap cache lookups could fail. However, the resolution preserves the `address_space` variable while applying the const qualifier and folio->swap usage as intended. The impact is limited to swap reclaim behavior under memory pressure. The fix makes the code more correct by ensuring the right swap slots are reclaimed during races, which should improve rather than degrade stability. To manage notifications about this bug go to: https://bugs.launchpad.net/hwe-next/+bug/2137755/+subscriptions

Комментариев нет:

Отправить комментарий