Update after additional testing and off-host netconsole capture. The original report was filed while the freezes were silent in the local journal, and the best visible clue was the NVIDIA `__warn_thunk` warning during `nvidia_init_module`. I have since reproduced the hard-freeze symptom on both kernel versions and captured the final kernel messages from two later freezes using netconsole to a second host. New findings: - The system also hard-froze on `7.0.0-22-generic`, so `7.0.0-22` is not a clean workaround. - The NVIDIA warning from the original report remains unexplained: `Unpatched return thunk in use. This should not happen!` - However, the final messages captured immediately before two later freezes are xHCI resume failures on AMD USB controllers, not NVIDIA NVRM/Xid messages. Crash 1: - Failed boot: 2026-06-28 15:12:35 to 2026-06-28 16:37:45 - Kernel: `7.0.0-22-generic` - Local journal stopped at 16:37:45. - Netconsole captured these final lines at 16:37:46: ```text xhci_hcd 0000:12:00.3: Controller not ready at resume -19 xhci_hcd 0000:12:00.3: PCI post-resume error -19! xhci_hcd 0000:12:00.3: HC died; cleaning up Crash 2: - Failed boot: 2026-06-28 16:41:15 to 2026-06-28 18:51:33 - Kernel: 7.0.0-27-generic - Local journal stopped at 18:51:33. - Netconsole captured these final lines at 18:52:54: 2026-06-28T18:52:54-0400 xhci_hcd 0000:12:00.4: Controller not ready at resume -19 2026-06-28T18:52:54-0400 xhci_hcd 0000:12:00.4: PCI post-resume error -19! 2026-06-28T18:52:54-0400 xhci_hcd 0000:12:00.4: HC died; cleaning up Relevant PCI devices: 12:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 12:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 0e:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) Current interpretation: - The strongest final-event evidence now points to AMD CPU-side xHCI runtime-resume failure, possibly involving runtime PM, firmware/AGESA, or platform PCIe power management. - I am not claiming the NVIDIA __warn_thunk warning is unrelated; it remains unresolved. But netconsole did not capture NVIDIA NVRM/Xid output at the freeze points it caught. - Because the same symptom reproduced on both 7.0.0-22-generic and 7.0.0-27-generic, this may not be specific to 7.0.0-27 despite the original title. Mitigation currently under test: I disabled runtime PM for all PCI xhci_hcd controllers for the current boot by setting each controller's power/control to on. Observed state after applying that mitigation: 0000:0e:00.0 control=on runtime=active 0000:10:00.0 control=on runtime=active 0000:12:00.3 control=on runtime=active 0000:12:00.4 control=on runtime=active 0000:13:00.0 control=on runtime=active If the system remains stable with xHCI runtime PM disabled, that will further support the xHCI runtime-resume hypothesis. If it freezes again, netconsole is still running and should capture whether the final failure remains xHCI-related or moves to another subsystem. Would welcome any other theories or logging or testing recommendations, this seems hard to reproduce but it is leading to frequent issues. ** Summary changed: - 7.0.0-27-generic hard freezes with NVIDIA 595 open module; __warn_thunk warning in nvidia_init_module + Hard freezes on 7.0.0-22 and -27 with X670E/Ryzen; AMD xHCI controller not ready at resume (-19) -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2158539 Title: Hard freezes on 7.0.0-22 and -27 with X670E/Ryzen; AMD xHCI controller not ready at resume (-19) Status in linux package in Ubuntu: New Bug description: After unattended-upgrade installed linux-image-7.0.0-27-generic on 2026-06-27, this system began hard-freezing with no clean shutdown and no panic in the journal. The prior kernel, 7.0.0-22-generic, is currently being tested as a workaround. Hardware: - ASUS ROG STRIX X670E-A GAMING WIFI, BIOS 3603 03/09/2026 - AMD Ryzen platform - Dual NVIDIA GeForce RTX 4090 - GNOME Wayland session Kernel/driver: - Bad kernel: 7.0.0-27-generic - Previously stable kernel: 7.0.0-22-generic - NVIDIA driver: 595.71.05, nvidia-driver-595-open - Kernel cmdline: pcie_aspm=off quiet splash Timeline: - 2026-06-27 06:14-06:15: unattended-upgrade installed 7.0.0-27 - 2026-06-27 08:06:35: first post-update boot froze; journal stopped abruptly - 2026-06-27 16:13:40: second boot froze; journal stopped abruptly - No systemd shutdown records for the failed boots Relevant warning on affected boot: Unpatched return thunk in use. This should not happen! WARNING: arch/x86/kernel/cpu/bugs.c:3736 at __warn_thunk Call trace includes: warn_thunk_thunk nvidia_init_module+0x29/0x740 [nvidia] Negative evidence: - No OOM, kernel panic, MCE, NVMe I/O error, SMART media error, thermal trip, or watchdog trace found. - NVMe SMART passed: media errors 0, error log entries 0. - Temperatures after reboot were not alarming. A later, separate GNOME Shell crash after reboot produced repeated: NVRM: VM: invalid mmap context ProblemType: Bug DistroRelease: Ubuntu 26.04 Package: linux-image-7.0.0-27-generic 7.0.0-27.27 ProcVersionSignature: Ubuntu 7.0.0-22.22-generic 7.0.0 Uname: Linux 7.0.0-22-generic x86_64 ApportVersion: 2.34.0-0ubuntu2 Architecture: amd64 CRDA: N/A CasperMD5CheckResult: pass CurrentDesktop: ubuntu:GNOME Date: Sat Jun 27 17:12:43 2026 InstallationDate: Installed on 2024-07-15 (713 days ago) InstallationMedia: Ubuntu 24.04 LTS "Noble Numbat" - Release amd64 (20240424) MachineType: ASUS System Product Name ProcEnviron: LANG=en_US.UTF-8 PATH=(custom, no user) SHELL=/usr/bin/zsh TERM=xterm-256color XDG_RUNTIME_DIR=<set> ProcFB: 0 nvidia-drmdrmfb 1 nvidia-drmdrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-7.0.0-22-generic root=UUID=5e50224a-2fb6-4607-84e3-9d99dee45bcb ro pcie_aspm=off quiet splash PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon. SourcePackage: linux UpgradeStatus: Upgraded to resolute on 2026-04-24 (65 days ago) dmi.bios.date: 03/09/2026 dmi.bios.release: 36.3 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 3603 dmi.board.asset.tag: Default string dmi.board.name: ROG STRIX X670E-A GAMING WIFI dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev 1.xx dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3603:bd03/09/2026:br36.3:svnASUS:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGSTRIXX670E-AGAMINGWIFI:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:pfaTobefilledbyO.E.M.: dmi.product.family: To be filled by O.E.M. dmi.product.name: System Product Name dmi.product.sku: SKU dmi.product.version: System Version dmi.sys.vendor: ASUS To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158539/+subscriptions
Комментариев нет:
Отправить комментарий