понедельник

[Bug 2089318] Re: kernel hard lockup in cgroups during eBPF workload

The review has been sent to the mailing list:
https://lists.ubuntu.com/archives/kernel-team/2025-January/156160.html

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2089318

Title:
kernel hard lockup in cgroups during eBPF workload

Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Jammy:
In Progress

Bug description:

SRU Justification:

[Impact]

There is a kernel hard lockup where all CPUs are stuck acquiring an
already-locked spinlock (css_set_lock) within the cgroup subsystem
when the user is running a certain eBPF program.

This has been hit in focal 5.15 backport kernels.

[Fix]
The bug was introduced in 5.15 via commit 74e4b956eb1c and fixed in 5.16 via 46307fd6e27a.


[Test Plan]

This change has been tested by the LP user maxwolffe who opened the
ticket and helped with the investigation.


[Where problems could occur]

Due to the changes in the cgroup code between 5.15 and 5.16, the code
was not a direct cherry-pick and had to be altered a bit. This could
cause issues in the future if there are further backports to the 5.15
kernel that touch the same area of the cgroup code.


Original bug description follows
----------------------------------
Hi friends,

We hit a kernel hard lockup where all CPUs are stuck acquiring an
already-locked spinlock (css_set_lock) within the cgroup subsystem.
Below are the call stacks from a memory dump of a two-core system
taken on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs
on Azure and GCP too. This is happening in a non-deterministic
fashion (less than 1%), and can occur at any time of the VM execution.
We suspect it's a deadlock triggered by some race condition, but we
don't know for sure.

```
PID: 21079 TASK: ffff91fdcd1dc000 CPU: 0 COMMAND: "sh"
 #0 [fffffe7127850cb8] machine_kexec at ffffffffadc92680
 #1 [fffffe7127850d18] __crash_kexec at ffffffffadda0b9f
 #2 [fffffe7127850de0] panic at ffffffffae8f56be
 #3 [fffffe7127850e70] unknown_nmi_error.cold at ffffffffae8eb4c8
 #4 [fffffe7127850e90] default_do_nmi at ffffffffae99c639
 #5 [fffffe7127850eb8] exc_nmi at ffffffffae99c7db
 #6 [fffffe7127850ef0] end_repeat_nmi at ffffffffaea017f3
    [exception RIP: native_queued_spin_lock_slowpath+63]
    RIP: ffffffffadd40eff RSP: ffffa1f68589fc60 RFLAGS: 00000002 (interrupt disabled!!)
    RAX: 0000000000000001 RBX: ffffffffb0ea5804 RCX: ffff91fb597c8980
    RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffb0ea5804
    RBP: ffffa1f68589fc88 R8: 0000000000005259 R9: 00000000597c8980
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1f68589fdf8
    R13: ffff91fdcd1d8000 R14: 0000000000004100 R15: ffff91fdcd1d8000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
 #7 [ffffa1f68589fc60] native_queued_spin_lock_slowpath at ffffffffadd40eff
 #8 [ffffa1f68589fc90] _raw_spin_lock_irq at ffffffffae9af19a
 #9 [ffffa1f68589fca0] cgroup_can_fork at ffffffffaddb0de8
#10 [ffffa1f68589fce8] copy_process at ffffffffadcc1938
#11 [ffffa1f68589fcf0] filemap_map_pages at ffffffffadeb68db
#12 [ffffa1f68589fdf0] __x64_sys_vfork at ffffffffadcc2a20
#13 [ffffa1f68589fe70] x64_sys_call at ffffffffadc068a9
#14 [ffffa1f68589fe80] do_syscall_64 at ffffffffae99a9e4
#15 [ffffa1f68589fec0] exit_to_user_mode_prepare at ffffffffadd725ad
#16 [ffffa1f68589ff00] irqentry_exit_to_user_mode at ffffffffae99f43e
#17 [ffffa1f68589ff10] irqentry_exit at ffffffffae99f46d
#18 [ffffa1f68589ff18] clear_bhb_loop at ffffffffaea018c5
#19 [ffffa1f68589ff28] clear_bhb_loop at ffffffffaea018c5
#20 [ffffa1f68589ff38] clear_bhb_loop at ffffffffaea018c5
#21 [ffffa1f68589ff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
    RIP: 00007fddfa4cebcc RSP: 00007fffaa741990 RFLAGS: 00000202
    RAX: ffffffffffffffda RBX: 000055ea66750428 RCX: 00007fddfa4cebcc
    RDX: 0000000000000000 RSI: 00007fffaa7419c0 RDI: 000055ea663c8866
    RBP: 0000000000000003 R8: 00007fffaa7419c0 R9: 000055ea667505f0
    R10: 0000000000000008 R11: 0000000000000202 R12: 00007fffaa7419c0
    R13: 00007fffaa741ae0 R14: 0000000000000000 R15: 000055ea663de810
    ORIG_RAX: 000000000000003a CS: 0033 SS: 002b

PID: 20304 TASK: ffff91fb05440000 CPU: 1 COMMAND: "Writer:Driver>C"
 #0 [fffffe6c293d3e10] crash_nmi_callback at ffffffffadc81ec0
 #1 [fffffe6c293d3e48] nmi_handle at ffffffffadc49b03
 #2 [fffffe6c293d3e90] default_do_nmi at ffffffffae99c5a5
 #3 [fffffe6c293d3eb8] exc_nmi at ffffffffae99c7db
 #4 [fffffe6c293d3ef0] end_repeat_nmi at ffffffffaea017f3
    [exception RIP: native_queued_spin_lock_slowpath+63]
    RIP: ffffffffadd40eff RSP: ffffa1f6853afd00 RFLAGS: 00000002 (interrupt disabled!!)
    RAX: 0000000000000001 RBX: ffffffffb0ea5804 RCX: ffff91fa1d0aee00
    RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffb0ea5804
    RBP: ffffa1f6853afd28 R8: 000000000000525a R9: 000000001d0aee00
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1f6853afe98
    R13: ffff91fd8eeea000 R14: 00000000003d0f00 R15: ffff91fd8eeea000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
 #5 [ffffa1f6853afd00] native_queued_spin_lock_slowpath at ffffffffadd40eff
 #6 [ffffa1f6853afd30] _raw_spin_lock_irq at ffffffffae9af19a
 #7 [ffffa1f6853afd40] cgroup_can_fork at ffffffffaddb0de8
 #8 [ffffa1f6853afd88] copy_process at ffffffffadcc1938
 #9 [ffffa1f6853afe20] kernel_clone at ffffffffadcc262d
#10 [ffffa1f6853afe90] __do_sys_clone at ffffffffadcc2a9d
#11 [ffffa1f6853aff10] __x64_sys_clone at ffffffffadcc2ae5
#12 [ffffa1f6853aff20] x64_sys_call at ffffffffadc05579
#13 [ffffa1f6853aff30] do_syscall_64 at ffffffffae99a9e4
#14 [ffffa1f6853aff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
    RIP: 00007f0d8bcac9f6 RSP: 00007f0cfabfcc38 RFLAGS: 00000206
    RAX: ffffffffffffffda RBX: 00007f0cfabfcc90 RCX: 00007f0d8bcac9f6
    RDX: 00007f0ced3ff910 RSI: 00007f0ced3feef0 RDI: 00000000003d0f00
    RBP: ffffffffffffff80 R8: 00007f0ced3ff640 R9: 00007f0ced3ff640
    R10: 00007f0ced3ff910 R11: 0000000000000206 R12: 00007f0ced3ff640
    R13: 0000000000000016 R14: 00007f0d8bc1b7d0 R15: 00007f0cfabfcdf0
    ORIG_RAX: 0000000000000038 CS: 0033 SS: 002b
```

Environment

```
$ uname -a
Linux ip-172-31-16-171 5.15.0-1072-aws #78~20.04.1-Ubuntu SMP Wed Oct 9 15:30:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
stepping : 6
microcode : 0xd0003e8
cpu MHz : 2900.036
cache size : 55296 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs mmio_stale_data eibrs_pbrsb gds bhi
bogomips : 5800.07
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
```

We see this very infrequently, but have experienced it on a variety of
instanceTypes - r6i.large , r6i.xlarge, r6i.2large at least.

Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions

Комментариев нет:

Отправить комментарий