Follow-up — 2026-07-04: mitigation held for 5 days; an accidental revert reproduced the freeze After my last comment I disabled PCI runtime power management on the xHCI controllers by setting each xhci_hcd controller's power/control to on (i.e. runtime PM off — the controllers never runtime-suspend, so there is no suspend→resume cycle to fail). With that mitigation in effect the system ran stable for ~5 days, including its normal workload — the longest clean stretch since this bug was filed. I accidentally reverted the mitigation today. I was moving some files and so systemd lost its target to apply the fix on boot, and next boot the unit failed and the controllers fell back to power/control=auto (runtime PM re-enabled). Shortly after that revert, a USB hotplug (attaching an Android device for adb) — a controller resume/enumerate event on a CPU-side xHCI controller [1022:15b6]/[1022:15b7] — triggered the hard freeze again. The right global fix is not simply abandoning power management but this does localize the fault to xHCI runtime suspend -> resume on these specific AMD Raphael/Granite Ridge CPU-root-complex controllers, which fail to come back after "Controller not ready at resume -19" The right fix is probably a PCI/xHCI quirk for these device IDs (disabling runtime suspend or D3cold for the affected controllers) similar to other existing XHCI_* quirks for other vendors, not sure. Happy to test whatever would help narrow it if that's helpful. -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2158539 Title: Hard freezes on 7.0.0-22 and -27 with X670E/Ryzen; AMD xHCI controller not ready at resume (-19) Status in linux package in Ubuntu: New Bug description: After unattended-upgrade installed linux-image-7.0.0-27-generic on 2026-06-27, this system began hard-freezing with no clean shutdown and no panic in the journal. The prior kernel, 7.0.0-22-generic, is currently being tested as a workaround. Hardware: - ASUS ROG STRIX X670E-A GAMING WIFI, BIOS 3603 03/09/2026 - AMD Ryzen platform - Dual NVIDIA GeForce RTX 4090 - GNOME Wayland session Kernel/driver: - Bad kernel: 7.0.0-27-generic - Previously stable kernel: 7.0.0-22-generic - NVIDIA driver: 595.71.05, nvidia-driver-595-open - Kernel cmdline: pcie_aspm=off quiet splash Timeline: - 2026-06-27 06:14-06:15: unattended-upgrade installed 7.0.0-27 - 2026-06-27 08:06:35: first post-update boot froze; journal stopped abruptly - 2026-06-27 16:13:40: second boot froze; journal stopped abruptly - No systemd shutdown records for the failed boots Relevant warning on affected boot: Unpatched return thunk in use. This should not happen! WARNING: arch/x86/kernel/cpu/bugs.c:3736 at __warn_thunk Call trace includes: warn_thunk_thunk nvidia_init_module+0x29/0x740 [nvidia] Negative evidence: - No OOM, kernel panic, MCE, NVMe I/O error, SMART media error, thermal trip, or watchdog trace found. - NVMe SMART passed: media errors 0, error log entries 0. - Temperatures after reboot were not alarming. A later, separate GNOME Shell crash after reboot produced repeated: NVRM: VM: invalid mmap context ProblemType: Bug DistroRelease: Ubuntu 26.04 Package: linux-image-7.0.0-27-generic 7.0.0-27.27 ProcVersionSignature: Ubuntu 7.0.0-22.22-generic 7.0.0 Uname: Linux 7.0.0-22-generic x86_64 ApportVersion: 2.34.0-0ubuntu2 Architecture: amd64 CRDA: N/A CasperMD5CheckResult: pass CurrentDesktop: ubuntu:GNOME Date: Sat Jun 27 17:12:43 2026 InstallationDate: Installed on 2024-07-15 (713 days ago) InstallationMedia: Ubuntu 24.04 LTS "Noble Numbat" - Release amd64 (20240424) MachineType: ASUS System Product Name ProcEnviron: LANG=en_US.UTF-8 PATH=(custom, no user) SHELL=/usr/bin/zsh TERM=xterm-256color XDG_RUNTIME_DIR=<set> ProcFB: 0 nvidia-drmdrmfb 1 nvidia-drmdrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-7.0.0-22-generic root=UUID=5e50224a-2fb6-4607-84e3-9d99dee45bcb ro pcie_aspm=off quiet splash PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon. SourcePackage: linux UpgradeStatus: Upgraded to resolute on 2026-04-24 (65 days ago) dmi.bios.date: 03/09/2026 dmi.bios.release: 36.3 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 3603 dmi.board.asset.tag: Default string dmi.board.name: ROG STRIX X670E-A GAMING WIFI dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev 1.xx dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3603:bd03/09/2026:br36.3:svnASUS:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGSTRIXX670E-AGAMINGWIFI:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:pfaTobefilledbyO.E.M.: dmi.product.family: To be filled by O.E.M. dmi.product.name: System Product Name dmi.product.sku: SKU dmi.product.version: System Version dmi.sys.vendor: ASUS To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158539/+subscriptions
Комментариев нет:
Отправить комментарий