воскресенье

[Bug 2154194] Re: [Jammy] Priority inversion problem in epoll for rt kernel

** Description changed: [SRU Justification] [Impact] The current epoll implementation in the 5.15 kernel utilizes a read-write semaphore (rwlock_t) to protect the ready event list. While this allows multiple producers to concurrently add items, it introduces a scheduling priority inversion vulnerability. If a high-priority consumer (such as a real-time thread calling epoll_wait) is blocked waiting for the exclusive write lock, it can be indefinitely stalled by a low-priority producer holding the read lock. This results in un-deterministic system stalls and latency spikes. [Fix] Cherry-pick upstream commit: 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock") The fix involves replacing rwlock_t with spinlock_t, and removing the now-redundant lockless helper functions (list_add_tail_lockless and chain_epi_lockless). This ensures that under real-time configurations, priority - inheritance works correctly across the epoll subsystem, eliminating the - priority inversion problem. + inheritance works correctly across the epoll subsystem. [Test Plan] This is a priority inversion race condition, so it is highly non-deterministic - and cannot be triggered on command. This is why it is not feasable to provide a - reliable reproduction script. + and impractical to trigger on command. This is why it is not feasable to + provide a reliable reproduction script. Therefore, validation relies on verifying that the replacement locking mechanism functions correctly, introduces no regressions, and scales safely under synthetic load. - There is a test kernel available in the following PPA: - https://launchpad.net/~munirsid/+archive/ubuntu/lp2154194 + Validation was performed on a 2-core/4GB RAM x86 VM running the test kernel in + the following PPA: https://launchpad.net/~munirsid/+archive/ubuntu/lp2154194. + + As mentioned in the upstream commit, we ran `perf bench epoll wait` with 4 + threads (-t 4) and 10 iterations (-r 10). By configuring 4 threads on a 2-core + VM, we intentionally overcommit the CPUs to force heavy context-switching and + lock preemption in order to stress-test the new spinlock boundaries under + contention. + + Observed Results: + + Before patch (5.15.0-179-generic #189-Ubuntu): + $ perf bench epoll wait -t 4 -r 10 + [thread 0] fdmap: 0x556599b44e80 ... 0x556599b44f7c [ 281994 ops/sec ] + [thread 1] fdmap: 0x556599b451a0 ... 0x556599b4529c [ 279775 ops/sec ] + [thread 2] fdmap: 0x556599b45420 ... 0x556599b4551c [ 267177 ops/sec ] + [thread 3] fdmap: 0x556599b456a0 ... 0x556599b4579c [ 270819 ops/sec ] + Averaged 274941 operations/sec (+- 1.29%), total secs = 10 + + After patch (5.15.0-183-generic #193+TEST427638v20260525b1-Ubuntu): + $ perf bench epoll wait -t 4 -r 10 + [thread 0] fdmap: 0x55a665734e80 ... 0x55a665734f7c [ 291941 ops/sec ] + [thread 1] fdmap: 0x55a6657351a0 ... 0x55a66573529c [ 306480 ops/sec ] + [thread 2] fdmap: 0x55a665735420 ... 0x55a66573551c [ 286868 ops/sec ] + [thread 3] fdmap: 0x55a6657356a0 ... 0x55a66573579c [ 312054 ops/sec ] + Averaged 299335 operations/sec (+- 1.98%), total secs = 10 + + Consistent with the upstream commit description for x86, we observed per-thread + throughput improve across all 4 threads, with ~8.9% average improvement in + throughput. + + No regression was observed and the logs showed no lockups, RCU stalls, or + kernel warnings across multiple iterations. [Where Problems Could Occur] There could be a performance degradation with some synthetic workloads on the GA kernel as seen in the upstream commit description [0]. In artificial benchmarks where hundreds of threads continuously spam epoll events, throughput can drop due to serialization around the new spinlock. However, testing with realistic workloads (via perf bench epoll wait) actually - demonstrates a performance improvement on x86 architectures. + demonstrates a performance improvement on x86 architectures, as mentioned in + the upstream commit, and demonstrated in the Test Plan section above. The regression potential for real-world production environments is low, as typical workloads do not exhibit continuous, uninterrupted event-spamming - behavior. Moreover, the fix is strictly isolated to fs/eventpoll.c and alters - no external kernel APIs. + behavior. Moreover, the fix is strictly isolated to fs/eventpoll.c and is + already available on Noble and Resolute, where it has been thoroughly tested. [Other Info] Similar issues have been reported in [1] and [2]. This bug was addressed - upstream [0] and has already been integrated into Noble and subsequent + upstream [0] and the fix has already been integrated into Noble and subsequent releases. Backporting this fix ensures stability for users of the 5.15 real- time kernel. [0] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd [1] - https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] - https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2154194 Title: [Jammy] Priority inversion problem in epoll for rt kernel Status in linux package in Ubuntu: New Status in linux source package in Jammy: New Status in linux source package in Noble: Fix Released Status in linux source package in Resolute: Fix Released Bug description: [SRU Justification] [Impact] The current epoll implementation in the 5.15 kernel utilizes a read-write semaphore (rwlock_t) to protect the ready event list. While this allows multiple producers to concurrently add items, it introduces a scheduling priority inversion vulnerability. If a high-priority consumer (such as a real-time thread calling epoll_wait) is blocked waiting for the exclusive write lock, it can be indefinitely stalled by a low-priority producer holding the read lock. This results in un-deterministic system stalls and latency spikes. [Fix] Cherry-pick upstream commit: 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock") The fix involves replacing rwlock_t with spinlock_t, and removing the now-redundant lockless helper functions (list_add_tail_lockless and chain_epi_lockless). This ensures that under real-time configurations, priority inheritance works correctly across the epoll subsystem. [Test Plan] This is a priority inversion race condition, so it is highly non-deterministic and impractical to trigger on command. This is why it is not feasable to provide a reliable reproduction script. Therefore, validation relies on verifying that the replacement locking mechanism functions correctly, introduces no regressions, and scales safely under synthetic load. Validation was performed on a 2-core/4GB RAM x86 VM running the test kernel in the following PPA: https://launchpad.net/~munirsid/+archive/ubuntu/lp2154194. As mentioned in the upstream commit, we ran `perf bench epoll wait` with 4 threads (-t 4) and 10 iterations (-r 10). By configuring 4 threads on a 2-core VM, we intentionally overcommit the CPUs to force heavy context-switching and lock preemption in order to stress-test the new spinlock boundaries under contention. Observed Results: Before patch (5.15.0-179-generic #189-Ubuntu): $ perf bench epoll wait -t 4 -r 10 [thread 0] fdmap: 0x556599b44e80 ... 0x556599b44f7c [ 281994 ops/sec ] [thread 1] fdmap: 0x556599b451a0 ... 0x556599b4529c [ 279775 ops/sec ] [thread 2] fdmap: 0x556599b45420 ... 0x556599b4551c [ 267177 ops/sec ] [thread 3] fdmap: 0x556599b456a0 ... 0x556599b4579c [ 270819 ops/sec ] Averaged 274941 operations/sec (+- 1.29%), total secs = 10 After patch (5.15.0-183-generic #193+TEST427638v20260525b1-Ubuntu): $ perf bench epoll wait -t 4 -r 10 [thread 0] fdmap: 0x55a665734e80 ... 0x55a665734f7c [ 291941 ops/sec ] [thread 1] fdmap: 0x55a6657351a0 ... 0x55a66573529c [ 306480 ops/sec ] [thread 2] fdmap: 0x55a665735420 ... 0x55a66573551c [ 286868 ops/sec ] [thread 3] fdmap: 0x55a6657356a0 ... 0x55a66573579c [ 312054 ops/sec ] Averaged 299335 operations/sec (+- 1.98%), total secs = 10 Consistent with the upstream commit description for x86, we observed per-thread throughput improve across all 4 threads, with ~8.9% average improvement in throughput. No regression was observed and the logs showed no lockups, RCU stalls, or kernel warnings across multiple iterations. [Where Problems Could Occur] There could be a performance degradation with some synthetic workloads on the GA kernel as seen in the upstream commit description [0]. In artificial benchmarks where hundreds of threads continuously spam epoll events, throughput can drop due to serialization around the new spinlock. However, testing with realistic workloads (via perf bench epoll wait) actually demonstrates a performance improvement on x86 architectures, as mentioned in the upstream commit, and demonstrated in the Test Plan section above. The regression potential for real-world production environments is low, as typical workloads do not exhibit continuous, uninterrupted event-spamming behavior. Moreover, the fix is strictly isolated to fs/eventpoll.c and is already available on Noble and Resolute, where it has been thoroughly tested. [Other Info] Similar issues have been reported in [1] and [2]. This bug was addressed upstream [0] and the fix has already been integrated into Noble and subsequent releases. Backporting this fix ensures stability for users of the 5.15 real- time kernel. [0] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd [1] - https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] - https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2154194/+subscriptions

Комментариев нет:

Отправить комментарий