Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2148538 Title: Kernel lockup on 6.17.0-1017-oem Lenovo P14s gen 6 AMD Ryzen AI 7 pro 350 Status in linux package in Ubuntu: Confirmed Status in linux-oem-6.17 package in Ubuntu: Confirmed Status in linux source package in Noble: Invalid Status in linux-oem-6.17 source package in Noble: In Progress Status in linux source package in Questing: In Progress Status in linux-oem-6.17 source package in Questing: Invalid Status in linux source package in Resolute: Invalid Status in linux-oem-6.17 source package in Resolute: Invalid Bug description: I have a Lenovo ThinkPad P14s Gen 6 AMD Ryzen AI 7 pro 350 I have installed - 24.04 LTS - 6.17.0-1017-oem - linux-image-6.17.0-1017-oem 6.17.0-1017.17 Regularly (at this point, seems to happen once per day), the kernel locks up (actually, let me amend that; I just noticed that the key combination I thought would go into terminal mode doesn't do that, so I don't know if the kernel has locked up or gnome has frozen). I haven't been able to identify particular workloads that are causing this behavior. At any given instance, I am running firefox, brave, mattermost, vs code, multiple terminal sessions, obsidian, perhaps some qemu vms (no GUI). Looking at the journal, after one recent lockup, I saw the following: amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. workqueue: amdgpu_tlb_fence_work [amdgpu] hogged CPU for >13333us 35 times, consider switching to WQ_UNBOUND amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. INFO: task kworker/7:0:37823 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/7:0 state:D stack:0 pid:37823 tgid:37823 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/15:0:39916 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/15:0 state:D stack:0 pid:39916 tgid:39916 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/12:4:40454 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/12:4 state:D stack:0 pid:40454 tgid:40454 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/5:4:40750 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/5:4 state:D stack:0 pid:40750 tgid:40750 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/15:3:40825 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/15:3 state:D stack:0 pid:40825 tgid:40825 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/10:8:40844 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/10:8 state:D stack:0 pid:40844 tgid:40844 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/10:9:40845 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/10:9 state:D stack:0 pid:40845 tgid:40845 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/4:2:41419 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/4:2 state:D stack:0 pid:41419 tgid:41419 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/10:0:41524 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/10:0 state:D stack:0 pid:41524 tgid:41524 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? _raw_spin_lock_irqsave+0xe/0x20 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/4:0:42116 blocked for more than 122 seconds. Tainted: P O 6.17.0-1017-oem #17-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/4:0 state:D stack:0 pid:42116 tgid:42116 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: events amdgpu_tlb_fence_work [amdgpu] Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_timeout+0x104/0x110 dma_fence_default_wait+0x1f0/0x250 ? __pfx_dma_fence_default_wait_cb+0x10/0x10 dma_fence_wait_timeout+0x13a/0x170 amdgpu_tlb_fence_work+0x29/0x140 [amdgpu] process_one_work+0x18e/0x3e0 worker_thread+0x2e3/0x420 ? __pfx_worker_thread+0x10/0x10 kthread+0x10a/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x121/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. Supervising 9 threads of 6 processes of 1 users. Supervising 9 threads of 6 processes of 1 users. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full. kauditd_printk_skb: 4 callbacks suppressed SRU Justification: [ Impact ] On Lenovo ThinkPad P14s Gen 6 systems with AMD Ryzen AI 7 PRO 350, the system can lock up under graphical workloads. Kernel logs show the AMDGPU MES ring buffer becoming full and hung task reports involving amdgpu_tlb_fence_work. The issue is caused by commit f3854e04b708 ("drm/amdgpu: attach tlb fence to the PTs update") which attaches TLB fences too broadly and can flood KIQ/MES TLB invalidation work. [ Fix ] Backport upstream fixes: - d967509651601 ("drm/amdgpu: make sure userqs are enabled in userq IOCTLs") - 9163fe4d790f (Revert "drm/amdgpu: don't attach the tlb fence for SI") - e9f58ff991dd ("drm/amdgpu: rework how we handle TLB fences") [ Test Plan ] 1. Boot the affected Lenovo P14s Gen 6 AMD Ryzen AI 7 PRO 350 system. 2. Exercise the graphical workload that previously reproduced the lockup. 3. Verify the system remains responsive and dmesg does not show MES ring buffer full or amdgpu_tlb_fence_work hung task messages. A test kernel with this fix series ran for 12 hours on the affected system without reproducing the lockup (LP #2148538 comment #29). [ Where problems could occur ] Regressions could appear as GPU hangs, failed VM updates, TLB flush failures, or user queue/KFD workload failures on AMD GPUs. [ Other Info ] Noble linux is not affected (offending commit absent). Resolute linux is not affected (fixes already present). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2148538/+subscriptions
Комментариев нет:
Отправить комментарий