суббота

Re: [Bug 2068738] Re: AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Hi Matthew, I have an Acer Aspire 5 and running Linux Mint 21.2 Victoria
base: Ubuntu 22.04 jammy with AMD Ryzen 7 3700U with Radeon Vega.
I get the Black screen on boot after installing updates about a week ago.
will there be an update
that will come through the normal update manager ? I don't have programming
skills like many users
of Linux so that would much appreciated. Thank You !

On Thu, Jun 13, 2024 at 8:25 PM Matthew Ruffell <2068738@bugs.launchpad.net>
wrote:

> Hi everyone,
>
> An update:
>
> Greg KH has picked up the patch and added it to upstream stable now:
>
> https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
> https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/
>
> I suppose we can drop the UBUNTU: SAUCE tags.
>
> I talked to Stefan Bader on the Kernel Team. His current feeling is that
> they might respin the -generic kernels before the release of the current
> cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still
> unsure. They might see what else comes up this cycle before they decide.
>
> I'll follow up with the Kernel Team in a couple days.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> In Progress
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19e/0x2e0
> do_init_module+0x52/0x260
> load_module+0xb45/0xbe0
> __do_sys_finit_module+0xbf/0x120
> __x64_sys_finit_module+0x18/0x20
> x64_sys_call+0x1ac3/0x1fa0
> do_syscall_64+0x56/0xb0
> ...
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
> [Fix]
>
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
>
> commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> Author: Yifan Zhang <yifan1.zhang@amd.com>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
> The fix is to revert this patch, as it was not suppose to be
> backported to 5.15 stable.
>
> The mailing list discussion with AMD developers is:
>
> https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
>
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> can NAK and resend.
>
> [Testcase]
>
> You need a system with an AMD Picasso/Raven 2 device. It will likely
> be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> 2 device is affected.
>
> Install the kernel and boot. Make sure full modesetting is enabled.
>
> There is a test kernel available in the ppa below:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
> If you install the test kernel, your system should boot successfully.
>
> [Where problems could occur]
>
> We are reverting a problematic patch and going back to how it was
> before 5.15.0-112-generic. This should not cause any issues for users.
>
> If a regression were to occur, users can set "nomodeset" or
> "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
> kernel to a working one.
>
> The impact of a regression would be high, as users displays could be
> blank.
>
> [Other Info]
>
> User reports:
> https://forums.linuxmint.com/viewtopic.php?t=421484
> https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> https://bugs.launchpad.net/bugs/2068812
>
> As bizarre as it is, this commit was actually originally included in
> 5.15-rc5:
>
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <yifan1.zhang@amd.com>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
> It seems to have caused issues back then too, and was removed in the
> following fixups, in 5.16-rc1:
>
> commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> Author: James Zhu <James.Zhu@amd.com>
> Date: Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
> commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> Author: shaoyunl <shaoyun.liu@amd.com>
> Date: Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
> I'm not exactly in favor of rewriting history twice, so I think we
> should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2068738

Title:
AMD GPUs fail with null pointer dereference when IOMMU enabled,
leading to black screen

Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Jammy:
In Progress

Bug description:
BugLink: https://bugs.launchpad.net/bugs/2068738

[Impact]

On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
enabled, the system fails to boot correctly, and all users see is a
black screen.

This is caused by a null pointer dereference when enabling the IOMMU
after the device has been initialised. It should happen the other way
around.

AMD-Vi: AMD IOMMUv2 loaded and initialized
...
amdgpu: Topology: Add APU node [0x15d8:0x1002]
kfd kfd: amdgpu: added device 1002:15d8
kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
...
amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
...
BUG: kernel NULL pointer dereference, address: 000000000000013c
...
CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
...
RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
...
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x10
? show_trace_log_lvl+0x28e/0x2ea
? show_trace_log_lvl+0x28e/0x2ea
? dm_hw_fini+0x23/0x30 [amdgpu]
? show_regs.part.0+0x23/0x29
? __die_body.cold+0x8/0xd
? __die+0x2b/0x37
? page_fault_oops+0x13b/0x170
? srso_return_thunk+0x5/0x10
? do_user_addr_fault+0x321/0x670
? srso_return_thunk+0x5/0x10
? __free_pages_ok+0x34a/0x4f0
? exc_page_fault+0x77/0x170
? asm_exc_page_fault+0x27/0x30
? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
dm_hw_fini+0x23/0x30 [amdgpu]
amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
local_pci_probe+0x4b/0x90
? srso_return_thunk+0x5/0x10
pci_device_probe+0x119/0x200
really_probe+0x222/0x420
__driver_probe_device+0xe8/0x140
driver_probe_device+0x23/0xc0
__driver_attach+0xf7/0x1f0
? __device_attach_driver+0x140/0x140
bus_for_each_dev+0x7f/0xd0
driver_attach+0x1e/0x30
bus_add_driver+0x148/0x220
? srso_return_thunk+0x5/0x10
driver_register+0x95/0x100
__pci_register_driver+0x68/0x70
amdgpu_init+0x7c/0x1000 [amdgpu]
? 0xffffffffc0e0b000
do_one_initcall+0x49/0x1e0
? srso_return_thunk+0x5/0x10
? kmem_cache_alloc_trace+0x19e/0x2e0
do_init_module+0x52/0x260
load_module+0xb45/0xbe0
__do_sys_finit_module+0xbf/0x120
__x64_sys_finit_module+0x18/0x20
x64_sys_call+0x1ac3/0x1fa0
do_syscall_64+0x56/0xb0
...
entry_SYSCALL_64_after_hwframe+0x67/0xd1

A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.

[Fix]

The regression was caused by the following commit that landed in
5.15.0-112-generic, and 5.15.150 upstream:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

The fix is to revert this patch, as it was not suppose to be
backported to 5.15 stable.

The mailing list discussion with AMD developers is:

https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/

The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
sending as a Ubuntu SAUCE patch. If the upstream status changes, we
can NAK and resend.

[Testcase]

You need a system with an AMD Picasso/Raven 2 device. It will likely
be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
2 device is affected.

Install the kernel and boot. Make sure full modesetting is enabled.

There is a test kernel available in the ppa below:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

If you install the test kernel, your system should boot successfully.

[Where problems could occur]

We are reverting a problematic patch and going back to how it was
before 5.15.0-112-generic. This should not cause any issues for users.

If a regression were to occur, users can set "nomodeset" or
"amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
kernel to a working one.

The impact of a regression would be high, as users displays could be
blank.

[Other Info]

User reports:
https://forums.linuxmint.com/viewtopic.php?t=421484
https://forums.linuxmint.com/viewtopic.php?t=421441
https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
https://bugs.launchpad.net/bugs/2068812

As bizarre as it is, this commit was actually originally included in
5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

It seems to have caused issues back then too, and was removed in the
following fixups, in 5.16-rc1:

commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
Author: James Zhu <James.Zhu@amd.com>
Date: Tue Nov 2 21:33:50 2021 -0400
Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c

commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
Author: shaoyunl <shaoyun.liu@amd.com>
Date: Fri Nov 5 12:34:14 2021 -0400
Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d

I'm not exactly in favor of rewriting history twice, so I think we
should just revert the upstream stable patch and move on.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions

Комментариев нет:

Отправить комментарий