среда

[Bug 2158106] [NEW] [Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system hangs

Public bug reported: [SRU Justification] [Impact] Systems on Jammy running high-throughput DMA workloads experience soft lockups and RCU stalls in fq_flush_timeout, which result in system hangs. The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) to avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; when both are full, the primary "loaded" magazine is pushed to a global depot (a fixed-size array of 32 magazines per size-bin). When the depot is also full, the overflow magazine is freed via iova_magazine_free_pfns(), which acquires iova_rbtree_lock and performs up to 128 rbtree lookups and removals while holding it. The problem manifests through the flush-queue timer. Every 10ms, fq_flush_timeout fires in softirq context and drains all CPUs' flush queues in a single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(), all recycled IOVAs are funnelled into the timer CPU's magazines. Once those magazines and the shared depot are full, every subsequent overflow triggers the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree operations under iova_rbtree_lock, all within the same softirq: fq_flush_timeout (timer softirq on CPU X) iova_domain_flush for_each_possible_cpu(cpu): fq_ring_free (up to IOVA_FQ_SIZE=256 entries) free_iova_fast __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr) if depot_size >= 32: iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock) The RCU stall trace from an affected system on 5.15.0-117 confirms this exact path with reliable stack frames: native_queued_spin_lock_slowpath+0x2c/0x40 _raw_spin_lock_irqsave+0x3d/0x50 iova_magazine_free_pfns.part.0+0x20/0xd0 free_iova_fast+0x219/0x290 fq_ring_free+0xa8/0x170 fq_flush_timeout+0x74/0xc0 call_timer_fn run_timer_softirq __do_softirq [Fix] Backport upstream commits, adapted for the 5.15 codebase: 1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better") 2. 233045378dbb ("iommu/iova: Manage the depot list size") Cherry-pick upstream commit: 3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false positive") Patch 1 replaces the fixed-size depot array with an unbounded singly-linked list. Magazines are always pushed to the depot regardless of size. As a result, the overflow path and its inline call to iova_magazine_free_pfns are eliminated from __iova_rcache_insert. Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding a delayed_work (background workqueue) that trims the depot when it exceeds num_online_cpus() magazines. This reclaim runs in process context, which is preemptible and sleepable, and therefore, cannot cause soft lockups. Patch 3 fixes a kmemleak false positive introduced by patch 1. Adaptations made for 5.15 backport: - Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h because in 5.15, struct iova_rcache is defined in the header (upstream moved it into iova.c in a prior refactoring series not present in 5.15). - The rcache init function in 5.15 is init_iova_rcaches() (static void, called unconditionally from init_iova_domain) rather than upstream's iova_domain_init_rcaches() (exported, returns int with error cleanup). The backport preserves the 5.15 function signature and error handling pattern. - 5.15 uses top-of-function variable declarations rather than upstream's C99 in-loop declarations. - The core logic (depot linked-list, overflow elimination, background worker) is identical between upstream and the backport. [Test Plan] TODO [Where problems could occur] Regression risk is low as changes in patches 1 and 2 are confined to the IOVA rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No changes have been made to IOVA allocation or free semantics from the caller's perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover, the fix is already available on Noble and Resolute, where it has been thoroughly tested. [Other Info] Similar issues have been reported in [0], [1], and [2]. The fix has already been integrated into Noble and subsequent releases. Backporting this fix ensures stability for users of the 5.15 kernel. [0] - https://lkml.rescloud.iu.edu/2304.1/01286.html [1] - https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/ [2] - https://access.redhat.com/solutions/7031930 ** Affects: linux (Ubuntu) Importance: Undecided Status: Fix Released ** Affects: linux (Ubuntu Jammy) Importance: Undecided Assignee: Munir Siddiqui (munirsid) Status: In Progress ** Affects: linux (Ubuntu Noble) Importance: Undecided Status: Fix Released ** Affects: linux (Ubuntu Resolute) Importance: Undecided Status: Fix Released ** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Resolute) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Noble) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Jammy) Status: New => In Progress ** Changed in: linux (Ubuntu Noble) Status: New => Fix Released ** Changed in: linux (Ubuntu Resolute) Status: New => Fix Released ** Changed in: linux (Ubuntu) Status: New => Fix Released ** Changed in: linux (Ubuntu Jammy) Assignee: (unassigned) => Munir Siddiqui (munirsid) -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2158106 Title: [Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system hangs Status in linux package in Ubuntu: Fix Released Status in linux source package in Jammy: In Progress Status in linux source package in Noble: Fix Released Status in linux source package in Resolute: Fix Released Bug description: [SRU Justification] [Impact] Systems on Jammy running high-throughput DMA workloads experience soft lockups and RCU stalls in fq_flush_timeout, which result in system hangs. The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) to avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; when both are full, the primary "loaded" magazine is pushed to a global depot (a fixed-size array of 32 magazines per size-bin). When the depot is also full, the overflow magazine is freed via iova_magazine_free_pfns(), which acquires iova_rbtree_lock and performs up to 128 rbtree lookups and removals while holding it. The problem manifests through the flush-queue timer. Every 10ms, fq_flush_timeout fires in softirq context and drains all CPUs' flush queues in a single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(), all recycled IOVAs are funnelled into the timer CPU's magazines. Once those magazines and the shared depot are full, every subsequent overflow triggers the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree operations under iova_rbtree_lock, all within the same softirq: fq_flush_timeout (timer softirq on CPU X) iova_domain_flush for_each_possible_cpu(cpu): fq_ring_free (up to IOVA_FQ_SIZE=256 entries) free_iova_fast __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr) if depot_size >= 32: iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock) The RCU stall trace from an affected system on 5.15.0-117 confirms this exact path with reliable stack frames: native_queued_spin_lock_slowpath+0x2c/0x40 _raw_spin_lock_irqsave+0x3d/0x50 iova_magazine_free_pfns.part.0+0x20/0xd0 free_iova_fast+0x219/0x290 fq_ring_free+0xa8/0x170 fq_flush_timeout+0x74/0xc0 call_timer_fn run_timer_softirq __do_softirq [Fix] Backport upstream commits, adapted for the 5.15 codebase: 1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better") 2. 233045378dbb ("iommu/iova: Manage the depot list size") Cherry-pick upstream commit: 3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false positive") Patch 1 replaces the fixed-size depot array with an unbounded singly-linked list. Magazines are always pushed to the depot regardless of size. As a result, the overflow path and its inline call to iova_magazine_free_pfns are eliminated from __iova_rcache_insert. Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding a delayed_work (background workqueue) that trims the depot when it exceeds num_online_cpus() magazines. This reclaim runs in process context, which is preemptible and sleepable, and therefore, cannot cause soft lockups. Patch 3 fixes a kmemleak false positive introduced by patch 1. Adaptations made for 5.15 backport: - Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h because in 5.15, struct iova_rcache is defined in the header (upstream moved it into iova.c in a prior refactoring series not present in 5.15). - The rcache init function in 5.15 is init_iova_rcaches() (static void, called unconditionally from init_iova_domain) rather than upstream's iova_domain_init_rcaches() (exported, returns int with error cleanup). The backport preserves the 5.15 function signature and error handling pattern. - 5.15 uses top-of-function variable declarations rather than upstream's C99 in-loop declarations. - The core logic (depot linked-list, overflow elimination, background worker) is identical between upstream and the backport. [Test Plan] TODO [Where problems could occur] Regression risk is low as changes in patches 1 and 2 are confined to the IOVA rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No changes have been made to IOVA allocation or free semantics from the caller's perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover, the fix is already available on Noble and Resolute, where it has been thoroughly tested. [Other Info] Similar issues have been reported in [0], [1], and [2]. The fix has already been integrated into Noble and subsequent releases. Backporting this fix ensures stability for users of the 5.15 kernel. [0] - https://lkml.rescloud.iu.edu/2304.1/01286.html [1] - https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/ [2] - https://access.redhat.com/solutions/7031930 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions

Комментариев нет:

Отправить комментарий