четверг

[Bug 2068738] Re: AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Hey,
As of now, the update to 5.15.0-112 is still getting proposed in the update manager, meaning that with automatic updates your computer might fail from one boot to the next, would it be possible to pull it from there?
Best and thanks for all the work
Anton

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2068738

Title:
AMD GPUs fail with null pointer dereference when IOMMU enabled,
leading to black screen

Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Jammy:
In Progress

Bug description:
BugLink: https://bugs.launchpad.net/bugs/2068738

[Impact]

On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
enabled, the system fails to boot correctly, and all users see is a
black screen.

This is caused by a null pointer dereference when enabling the IOMMU
after the device has been initialised. It should happen the other way
around.

AMD-Vi: AMD IOMMUv2 loaded and initialized
...
amdgpu: Topology: Add APU node [0x15d8:0x1002]
kfd kfd: amdgpu: added device 1002:15d8
kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
...
amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
...
BUG: kernel NULL pointer dereference, address: 000000000000013c
...
CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
...
RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
...
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x10
? show_trace_log_lvl+0x28e/0x2ea
? show_trace_log_lvl+0x28e/0x2ea
? dm_hw_fini+0x23/0x30 [amdgpu]
? show_regs.part.0+0x23/0x29
? __die_body.cold+0x8/0xd
? __die+0x2b/0x37
? page_fault_oops+0x13b/0x170
? srso_return_thunk+0x5/0x10
? do_user_addr_fault+0x321/0x670
? srso_return_thunk+0x5/0x10
? __free_pages_ok+0x34a/0x4f0
? exc_page_fault+0x77/0x170
? asm_exc_page_fault+0x27/0x30
? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
dm_hw_fini+0x23/0x30 [amdgpu]
amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
local_pci_probe+0x4b/0x90
? srso_return_thunk+0x5/0x10
pci_device_probe+0x119/0x200
really_probe+0x222/0x420
__driver_probe_device+0xe8/0x140
driver_probe_device+0x23/0xc0
__driver_attach+0xf7/0x1f0
? __device_attach_driver+0x140/0x140
bus_for_each_dev+0x7f/0xd0
driver_attach+0x1e/0x30
bus_add_driver+0x148/0x220
? srso_return_thunk+0x5/0x10
driver_register+0x95/0x100
__pci_register_driver+0x68/0x70
amdgpu_init+0x7c/0x1000 [amdgpu]
? 0xffffffffc0e0b000
do_one_initcall+0x49/0x1e0
? srso_return_thunk+0x5/0x10
? kmem_cache_alloc_trace+0x19e/0x2e0
do_init_module+0x52/0x260
load_module+0xb45/0xbe0
__do_sys_finit_module+0xbf/0x120
__x64_sys_finit_module+0x18/0x20
x64_sys_call+0x1ac3/0x1fa0
do_syscall_64+0x56/0xb0
...
entry_SYSCALL_64_after_hwframe+0x67/0xd1

A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.

[Fix]

The regression was caused by the following commit that landed in
5.15.0-112-generic, and 5.15.150 upstream:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

The fix is to revert this patch, as it was not suppose to be
backported to 5.15 stable.

The mailing list discussion with AMD developers is:

https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/

The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
sending as a Ubuntu SAUCE patch. If the upstream status changes, we
can NAK and resend.

[Testcase]

You need a system with an AMD Picasso/Raven 2 device. It will likely
be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
2 device is affected.

Install the kernel and boot. Make sure full modesetting is enabled.

There is a test kernel available in the ppa below:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

If you install the test kernel, your system should boot successfully.

[Where problems could occur]

We are reverting a problematic patch and going back to how it was
before 5.15.0-112-generic. This should not cause any issues for users.

If a regression were to occur, users can set "nomodeset" or
"amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
kernel to a working one.

The impact of a regression would be high, as users displays could be
blank.

[Other Info]

User reports:
https://forums.linuxmint.com/viewtopic.php?t=421484
https://forums.linuxmint.com/viewtopic.php?t=421441
https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
https://bugs.launchpad.net/bugs/2068812

As bizarre as it is, this commit was actually originally included in
5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

It seems to have caused issues back then too, and was removed in the
following fixups, in 5.16-rc1:

commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
Author: James Zhu <James.Zhu@amd.com>
Date: Tue Nov 2 21:33:50 2021 -0400
Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c

commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
Author: shaoyunl <shaoyun.liu@amd.com>
Date: Fri Nov 5 12:34:14 2021 -0400
Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d

I'm not exactly in favor of rewriting history twice, so I think we
should just revert the upstream stable patch and move on.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions

Комментариев нет:

Отправить комментарий