воскресенье

[Bug 2144577] Re: BUG: kernel NULL pointer dereference in amdgpu

I tested Ubuntu 25.10 with kernel 6.17.0-24-generic and it does boot on my PC. ** Tags removed: verification-needed-questing-linux ** Tags added: verification-done-questing-linux -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2144577 Title: BUG: kernel NULL pointer dereference in amdgpu Status in linux package in Ubuntu: Fix Released Status in linux source package in Noble: In Progress Status in linux source package in Questing: In Progress Status in linux source package in Resolute: Fix Released Bug description: SRU Justification [Impact] System freezes during boot on machines with AMD Southern Islands (SI) GPUs using the amdgpu driver . The amdgpu driver calls flush_gpu_tlb_pasid() in a workqueue, but on SI hardware this function pointer is NULL. The kernel hits a NULL pointer dereference in amdgpu_gmc_flush_gpu_tlb_pasid() and crashes. Error log: kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu] kernel: RIP: 0010:0x0 kernel: Call Trace: kernel: amdgpu_gmc_flush_gpu_tlb_pasid+0xfd/0x480 [amdgpu] kernel: amdgpu_tlb_fence_work+0x77/0x110 [amdgpu] Hits every boot on affected hardware. Regression from 6.17.0-14 to 6.17.0-19. [Fix] Two patches fix this together: 1. f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Adds a NULL check for flush_gpu_tlb_pasid before calling it. Upstream in v7.0-rc1. 2. e3a6eff92bbd ("drm/amdgpu: Fix validating flush_gpu_tlb_pasid()") Fixes the first patch — the early return skipped the unlock, causing a deadlock. Changes the bare return to a goto that unlocks first. Upstream in v7.0-rc1. Fixes: f4db9913e4d3 [Test Plan] On a machine with an AMD SI GPU (Tahiti, Pitcairn, Verde, Oland, Hainan) booted with amdgpu.si_support=1: $ sudo reboot Without patches: kernel NULL pointer dereference during boot, system freezes. With patches: system boots normally, no crash or error in dmesg. Check dmesg after boot: $ dmesg | grep -i "BUG\|NULL pointer\|amdgpu" Without patches: "BUG: kernel NULL pointer dereference" present. With patches: no BUG or NULL pointer lines. [Where problems could occur] Could break TLB flushing on amdgpu. If the NULL check gates too broadly, TLB flushes could be skipped on GPUs that do have flush_gpu_tlb_pasid. This would cause stale TLB entries and GPU page faults or rendering corruption. The unlock path change in the second patch touches the reset/lock logic in amdgpu_gmc_flush_gpu_tlb_pasid(). A wrong goto target could leave the reset domain lock held, deadlocking the GPU. [Other Info] Both patches are upstream in v7.0-rc1. =========================================================== Ubuntu 25.10 with kernel 6.17.0-19-generic doesn't boot on my PC. I freezes on the booting screen, and the kernel logs show a bug: kernel: Linux version 6.17.0-19-generic (buildd@lcy02-amd64-084) (x86_64-linux-gnu-gcc (Ubuntu 15.2.0-4ubuntu4) 15.2.0, GNU ld (GNU Binutils for Ubuntu) 2.45) #19-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 6 14:02:58 UTC 2026 (Ubuntu 6.17.0-19.19-generic 6.17.13) kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.17.0-19-generic root=UUID=354e3c09-bfde-4e47-850f-fe872a882ae5 ro quiet splash radeon.si_support=0 amdgpu.si_support=1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M vt.handoff=7 # ... kernel: [drm] Initialized amdgpu 3.64.0 for 0000:01:00.0 on minor 1 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: #PF: supervisor instruction fetch in kernel mode kernel: #PF: error_code(0x0010) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: Oops: 0010 [#1] SMP PTI kernel: CPU: 3 UID: 0 PID: 109 Comm: kworker/3:1 Not tainted 6.17.0-19-generic #19-Ubuntu PREEMPT(voluntary) kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro3, BIOS P1.10 04/10/2012 kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu] kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. kernel: RSP: 0018:ffffce560061fdb0 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000001 kernel: RDX: 0000000000000002 RSI: 0000000000008000 RDI: ffff8a4a6d180000 kernel: RBP: ffffce560061fe08 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 kernel: R13: 0000000000000000 R14: ffff8a4a6d180000 R15: 0000000000000000 kernel: FS: 0000000000000000(0000) GS:ffff8a4da87ff000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 00000003de040002 CR4: 00000000001726f0 kernel: Call Trace: kernel: <TASK> kernel: amdgpu_gmc_flush_gpu_tlb_pasid+0xfd/0x480 [amdgpu] kernel: amdgpu_tlb_fence_work+0x77/0x110 [amdgpu] kernel: process_one_work+0x18e/0x370 kernel: worker_thread+0x317/0x450 kernel: ? _raw_spin_lock_irqsave+0xe/0x20 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10b/0x220 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x134/0x150 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork_asm+0x1a/0x30 kernel: </TASK> kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_mac xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat binfmt_misc nf_tables amdgpu(+) usblp intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel amdxcp at24 mei_hdcp mei_pxp kvm snd_hda_codec_atihdmi drm_panel_backlight_quirks gpu_sched irqbypass snd_hda_codec_hdmi drm_buddy snd_hda_codec_alc662 rapl btusb snd_hda_codec_realtek_lib intel_cstate snd_hda_codec_generic radeon btrtl snd_hda_intel btintel i2c_i801 btbcm snd_hda_codec btmtk i2c_smbus drm_ttm_helper i2c_mux ttm bluetooth snd_seq_midi snd_hda_core snd_seq_midi_event drm_exec snd_intel_dspcfg snd_rawmidi drm_suballoc_helper snd_intel_sdw_acpi drm_display_helper lpc_ich snd_hwdep snd_seq snd_pcm snd_seq_device cec snd_timer rc_core snd i2c_algo_bit soundcore mei_me mei intel_smartconnect joydev kernel: input_leds mac_hid sch_fq_codel msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 dm_crypt wacom uas usb_storage hid_generic usbhid hid r8169 polyval_clmulni ghash_clmulni_intel psmouse ahci realtek serio_raw libahci video wmi aesni_intel kernel: CR2: 0000000000000000 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. kernel: RSP: 0018:ffffce560061fdb0 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000001 kernel: RDX: 0000000000000002 RSI: 0000000000008000 RDI: ffff8a4a6d180000 kernel: RBP: ffffce560061fe08 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 kernel: R13: 0000000000000000 R14: ffff8a4a6d180000 R15: 0000000000000000 kernel: FS: 0000000000000000(0000) GS:ffff8a4da87ff000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 00000003de040002 CR4: 00000000001726f0 kernel: note: kworker/3:1[109] exited with irqs disabled kernel: loop50: detected capacity change from 0 to 8 kernel: fbcon: amdgpudrmfb (fb0) is primary device kernel: fbcon: Deferring console take-over kernel: amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device kernel: NET: Registered PF_QIPCRTR protocol family kernel: sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7 > kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: #PF: supervisor instruction fetch in kernel mode kernel: #PF: error_code(0x0010) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: Oops: 0010 [#2] SMP PTI kernel: CPU: 1 UID: 0 PID: 91 Comm: kworker/1:1 Tainted: G D 6.17.0-19-generic #19-Ubuntu PREEMPT(voluntary) kernel: Tainted: [D]=DIE kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro3, BIOS P1.10 04/10/2012 kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu] kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. kernel: RSP: 0000:ffffce5600477db0 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: 0000000000008001 RCX: 0000000000000001 kernel: RDX: 0000000000000002 RSI: 0000000000008001 RDI: ffff8a4a6d180000 kernel: RBP: ffffce5600477e08 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 kernel: R13: 0000000000000000 R14: ffff8a4a6d180000 R15: 0000000000000000 kernel: FS: 0000000000000000(0000) GS:ffff8a4da86ff000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 0000000101242006 CR4: 00000000001726f0 kernel: Call Trace: kernel: <TASK> kernel: amdgpu_gmc_flush_gpu_tlb_pasid+0xfd/0x480 [amdgpu] kernel: amdgpu_tlb_fence_work+0x77/0x110 [amdgpu] kernel: process_one_work+0x18e/0x370 kernel: worker_thread+0x317/0x450 kernel: ? _raw_spin_lock_irqsave+0xe/0x20 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10b/0x220 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x134/0x150 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork_asm+0x1a/0x30 kernel: </TASK> kernel: Modules linked in: qrtr rfcomm cmac algif_hash algif_skcipher af_alg bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_mac xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat binfmt_misc nf_tables amdgpu usblp intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel amdxcp at24 mei_hdcp mei_pxp kvm snd_hda_codec_atihdmi drm_panel_backlight_quirks gpu_sched irqbypass snd_hda_codec_hdmi drm_buddy snd_hda_codec_alc662 rapl btusb snd_hda_codec_realtek_lib intel_cstate snd_hda_codec_generic radeon btrtl snd_hda_intel btintel i2c_i801 btbcm snd_hda_codec btmtk i2c_smbus drm_ttm_helper i2c_mux ttm bluetooth snd_seq_midi snd_hda_core snd_seq_midi_event drm_exec snd_intel_dspcfg snd_rawmidi drm_suballoc_helper snd_intel_sdw_acpi drm_display_helper lpc_ich snd_hwdep snd_seq snd_pcm snd_seq_device cec snd_timer rc_core snd i2c_algo_bit soundcore mei_me mei intel_smartconnect joydev kernel: input_leds mac_hid sch_fq_codel msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 dm_crypt wacom uas usb_storage hid_generic usbhid hid r8169 polyval_clmulni ghash_clmulni_intel psmouse ahci realtek serio_raw libahci video wmi aesni_intel kernel: CR2: 0000000000000000 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. kernel: RSP: 0018:ffffce560061fdb0 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000001 kernel: RDX: 0000000000000002 RSI: 0000000000008000 RDI: ffff8a4a6d180000 kernel: RBP: ffffce560061fe08 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 kernel: R13: 0000000000000000 R14: ffff8a4a6d180000 R15: 0000000000000000 kernel: FS: 0000000000000000(0000) GS:ffff8a4da86ff000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 0000000101242006 CR4: 00000000001726f0 kernel: note: kworker/1:1[91] exited with irqs disabled The previous kernel 6.17.0-14-generic boots without any issues. I'll try to attach the required information using `apport-collect -p linux BUG#`, but it'll be collected after successfully booting with 6.17.0-14, whereas the bug occurs with 6.17.0-19. --- ProblemType: Bug ApportVersion: 2.33.1-0ubuntu3 Architecture: amd64 AudioDevicesInUse:  USER PID ACCESS COMMAND  /dev/snd/controlC0: mateusz 3017 F.... wireplumber  /dev/snd/controlC1: mateusz 3017 F.... wireplumber  /dev/snd/seq: mateusz 2999 F.... pipewire CasperMD5CheckResult: unknown CurrentDesktop: ubuntu:GNOME DistroRelease: Ubuntu 25.10 InstallationDate: Installed on 2020-10-14 (1979 days ago) InstallationMedia: Ubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731) MachineType: To Be Filled By O.E.M. To Be Filled By O.E.M. Package: linux (not installed) ProcEnviron:  LANG=pl_PL.UTF-8  PATH=(custom, no user)  SHELL=/bin/bash  TERM=xterm-256color ProcFB: 0 amdgpudrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-6.17.0-14-generic root=UUID=354e3c09-bfde-4e47-850f-fe872a882ae5 ro quiet splash radeon.si_support=0 amdgpu.si_support=1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M vt.handoff=7 ProcVersionSignature: Ubuntu 6.17.0-14.14-generic 6.17.9 RelatedPackageVersions:  firmware-sof N/A  linux-firmware 20250901.git993ff19b-0ubuntu1.9 RfKill:  0: hci0: Bluetooth   Soft blocked: yes   Hard blocked: no Tags: questing Uname: Linux 6.17.0-14-generic x86_64 UpgradeStatus: Upgraded to questing on 2026-01-10 (65 days ago) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 04/10/2012 dmi.bios.release: 4.6 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: P1.10 dmi.board.name: Z77 Pro3 dmi.board.vendor: ASRock dmi.chassis.asset.tag: To Be Filled By O.E.M. dmi.chassis.type: 3 dmi.chassis.vendor: To Be Filled By O.E.M. dmi.chassis.version: To Be Filled By O.E.M. dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrP1.10:bd04/10/2012:br4.6:svnToBeFilledByO.E.M.:pnToBeFilledByO.E.M.:pvrToBeFilledByO.E.M.:rvnASRock:rnZ77Pro3:rvr:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:skuToBeFilledByO.E.M.: dmi.product.family: To Be Filled By O.E.M. dmi.product.name: To Be Filled By O.E.M. dmi.product.sku: To Be Filled By O.E.M. dmi.product.version: To Be Filled By O.E.M. dmi.sys.vendor: To Be Filled By O.E.M. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2144577/+subscriptions

Комментариев нет:

Отправить комментарий