четверг

[Bug 2158993] [NEW] System hard-freezes after failed suspend (Xwayland freeze timeout) with nvidia-driver-595-open — deadlock in nvidia_modeset/console

Public bug reported: # Bug Report: System hard-freezes after failed suspend (Xwayland freeze timeout) with NVIDIA 595-open driver — deadlock between nvidia_modeset and console/fbcon subsystem ## Summary On a desktop system with an NVIDIA RTX 4070 Ti (nvidia-driver-595-open, 595.71.05) running Ubuntu 26.04 "resolute" on kernel 7.0.0-27-generic, attempting to suspend (`mem_sleep_default=deep`, S3) intermittently fails as follows: 1. `Freezing user space processes` fails after the default 20s timeout because a userspace task (observed: `Xwayland`, correlated with an active Steam client) refuses to freeze in time. 2. The kernel aborts the suspend and falls back to `fbcon: Taking over console`. 3. During this fallback, the NVIDIA driver's internal memory-management code (`NVRM: GPU0 nvAssertFailedNoLog`, `kern_bus_gv100.c:388`, `mmu_walk*.c`) begins emitting a continuous stream of assertion failures — one burst approximately every 30 seconds — indicating GPU virtual-address-space mapping has entered a broken state. 4. A kernel worker thread (`kworker/0:1`, workqueue `fbcon_register_existing_fbs`) becomes stuck for 245+ seconds inside `nvkms_ioctl_from_kapi` / `GetDynamicDisplayInfo`, waiting on an rwsem held (apparently) by the NVIDIA driver. 5. `systemd-sleep`, attempting `pm_restore_console` as part of aborting the suspend, blocks indefinitely on `console_lock`, which the kernel explicitly reports as "likely last held by task kworker/0:1:11" — i.e. the same stuck worker from step 4. 6. This is a genuine circular-wait deadlock between the console/VT subsystem and the NVIDIA kernel module, not a simple slow device. 7. The system does not always crash immediately after this deadlock — it can continue running for several hours in a visibly degraded state (other services such as `cups.service` and `fwupd-refresh.service` begin entering restart loops, killed repeatedly with SIGKILL and never exiting cleanly) before eventually becoming totally unresponsive and requiring a hard power-button reset. No further kernel log entries are written between the last responsive log line and the forced reboot, consistent with a full system lockup rather than a clean panic. This has now been observed and diagnosed across three separate incidents over the space of about 48 hours, all sharing the same signature (failed freeze → NVRM assertion cascade → eventual hard freeze requiring power- cycle). ## System Information - **Ubuntu release:** 26.04 "resolute" (resolute-updates, resolute-security) - **Kernel:** 7.0.0-27-generic - **CPU:** AMD Ryzen 7 5700X3D - **GPU:** NVIDIA RTX 4070 Ti (PCI ID 10DE:2782, subsystem 1462:5132) - **NVIDIA driver package:** nvidia-driver-595-open, version 595.71.05-0ubuntu0.26.04.1 (also nvidia-driver-590-open installed but not active/default) - **Display stack:** GNOME on Wayland (Xwayland for X11 apps) - **Kernel command line (GRUB_CMDLINE_LINUX_DEFAULT):** ``` quiet splash nvidia-drm.modeset=1 mem_sleep_default=deep usbcore.autosuspend=-1 ``` - **`/sys/power/mem_sleep`:** `s2idle [deep]` (deep in use) ## Workarounds already tried 1. **Switched sleep mode from `deep` (S3) to `s2idle`** — did **not** prevent the freeze-timeout/deadlock pattern; the underlying trigger (Xwayland refusing to freeze) is independent of the ACPI sleep mode. 2. **Added `/etc/modprobe.d/nvidia-suspend-fix.conf`:** ``` options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp options nvidia NVreg_UseKernelSuspendNotifiers=0 ``` Confirmed active via `/proc/driver/nvidia/params` after reboot. This reduced the frequency of memory-corruption symptoms somewhat but **did not eliminate** the underlying freeze-timeout → NVRM assertion → deadlock sequence; it recurred on a subsequent night with this configuration active. ## Steps to Reproduce (best current understanding) 1. Have an X11/Xwayland-backed application under load at the moment the system is asked to suspend — in the two clearest captures, a Steam client was active and had logged `CSteamEngine::BMainLoop appears to have stalled > 15 seconds` at the same timestamp as the freeze failure. 2. Trigger suspend (automatic idle suspend via `systemd-logind`/GNOME power settings, or manual). 3. Kernel begins `Freezing user space processes`; if Xwayland does not freeze within 20s, the kernel aborts the freeze and logs `Freezing user space processes failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0)`. 4. From this point, NVRM assertion-failure spam begins, and eventually the fbcon/console_lock deadlock described above can occur. 5. System may continue running in a degraded state for hours before a full hard freeze. ## Expected Behavior If a task refuses to freeze in time, the kernel should cleanly abort the suspend and return the system to a fully functional state, without leaving the NVIDIA driver or the console subsystem in a corrupted/deadlocked state. ## Actual Behavior The abort path itself deadlocks: `nvidia_modeset`'s internal locking (exercised via the `fbcon_register_existing_fbs` workqueue triggered by `fbcon: Taking over console`) contends with `systemd-sleep`'s own console-restore path (`pm_restore_console` → `console_lock`), and the two can end up blocking on each other. The system may appear to "recover" superficially but is left in a state that leads to cascading failures (other systemd services entering unkillable SIGKILL-retry loops) and, eventually, a full unrecoverable freeze. ## Log Evidence (key excerpts, kernel 7.0.0-27-generic) ``` kernel: PM: suspend entry (deep) kernel: Filesystems sync: 0.353 seconds kernel: Freezing user space processes kernel: Freezing user space processes failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0): kernel: fbcon: Taking over console kernel: task:Xwayland state:R running task stack:0 pid:5165 tgid:5165 ppid:4743 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NULL != pIter->pMap @ virt_mem_allocator_gm107.c:2024 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: progress == entryIndexHi - entryIndexLo + 1 @ mmu_walk_map.c:170 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:541 kernel: NVRM: GPU0 mmuWalkMap: Failed to map VA Range 0x2f000000 to 0x2f1fffff. Status = 0x00000040 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL) @ kern_bus_gv100.c:388 [... this block repeats roughly every 30 seconds for several minutes ...] kernel: INFO: task kworker/0:1:11 blocked for more than 245 seconds. kernel: Tainted: G O 7.0.0-27-generic #27-Ubuntu kernel: task:kworker/0:1 state:D stack:0 pid:11 tgid:11 ppid:2 kernel: Workqueue: events fbcon_register_existing_fbs kernel: Call Trace: kernel: rwsem_down_read_slowpath+... kernel: down_read+0x48/0xd0 kernel: nvkms_ioctl_from_kapi+0xdc/0xf0 [nvidia_modeset] kernel: GetDynamicDisplayInfo+0x9c/0x190 [nvidia_modeset] ... kernel: INFO: task systemd-sleep:35518 blocked for more than 245 seconds. kernel: task:systemd-sleep state:D stack:0 pid:35518 tgid:35518 ppid:1 kernel: Call Trace: kernel: down+0x5e/0x80 kernel: console_lock+0x2f/0x50 kernel: vt_move_to_console+0x19/0xb0 kernel: pm_restore_console+0x4d/0x70 kernel: enter_state+0x120/0x610 kernel: pm_suspend+0x49/0x90 kernel: INFO: task systemd-sleep:35518 blocked on a semaphore likely last held by task kworker/0:1:11 ``` Later the same night, unrelated services begin failing identically (repeated SIGKILL, never exiting): ``` systemd[1]: cups.service: start operation timed out. Terminating. systemd[1]: cups.service: State 'stop-sigterm' timed out. Killing. systemd[1]: cups.service: Killing process 35995 (9) with signal SIGKILL. systemd[1]: cups.service: Processes still around after SIGKILL. Ignoring. [cycle repeats ~35 times over 4+ hours] ``` No further journal entries follow the last cycle; the machine was unresponsive and required a hard power-button reset. ## Note on driver versions checked At the time of filing, NVIDIA's production branch had advanced to 595.84 (released 2026-06-17), one release ahead of the 595.71.05 installed here. I checked whether 595.84 was available as a packaged driver for Ubuntu 26.04 "resolute" before filing, to rule out that this was already fixed: - `resolute-updates` / `resolute-security` (official Ubuntu archive): only 595.71.05 - `ppa:graphics-drivers/ppa`: only 595.71.05 for `resolute` - NVIDIA's own CUDA apt repository (`developer.download.nvidia.com/compute/cuda/repos/ubuntu2604`): does not carry a `nvidia-driver-595-open` package matching 595.84 either (only unrelated tooling from the newer 610.x feature branch, e.g. `nvidia-settings`/`libxnvctrl0` 610.43.02) 595.84 is therefore only available as NVIDIA's `.run` installer for this Ubuntu release at present, which was intentionally not used here to avoid DKMS/Secure Boot conflicts with the distro-packaged driver. This report is filed against 595.71.05; it is not yet known whether 595.84 resolves the issue. ## Note on driver versions checked At the time of filing, NVIDIA's production branch had advanced to 595.84 (released 2026-06-17), one release ahead of the 595.71.05 installed here. I checked whether 595.84 was available as a packaged driver for Ubuntu 26.04 "resolute" before filing, to rule out that this was already fixed: - `resolute-updates` / `resolute-security` (official Ubuntu archive): only 595.71.05 - `ppa:graphics-drivers/ppa`: only 595.71.05 for `resolute` - NVIDIA's own CUDA apt repository (`developer.download.nvidia.com/compute/cuda/repos/ubuntu2604`): does not carry a `nvidia-driver-595-open` package matching 595.84 either (only unrelated tooling from the newer 610.x feature branch, e.g. `nvidia-settings`/`libxnvctrl0` 610.43.02) 595.84 is therefore only available as NVIDIA's `.run` installer for this Ubuntu release at present, which was intentionally not used here to avoid DKMS/Secure Boot conflicts with the distro-packaged driver. This report is filed against 595.71.05; it is not yet known whether 595.84 resolves the issue. The 595.84 changelog lists "Fixed a bug that could cause suspend and resume to fail on systems with runtime D3 (RTD3) power management enabled." I checked whether this applies here: ``` $ cat /proc/driver/nvidia/gpus/*/power Runtime D3 status: Disabled by default ... ``` RTD3 is disabled by default on this system (single desktop GPU with a directly-attached display, no hybrid/Optimus setup), so this specific changelog entry likely does not describe the same bug — the deadlock documented below appears unrelated to RTD3 and is filed as a distinct issue. ## Possibly related upstream reports - Ubuntu Launchpad bug **#2149963** (package `linux`) — RTX 50-series + nvidia-open 595/580 on kernel 7.0.0-14-generic, Ubuntu 26.04 "resolute": s2idle resume never completes after lid-close, requires hard reset. Same kernel/driver generation, same distro release; different GPU generation and different sleep mode (s2idle vs deep here), but the same overall "suspend/resume path never returns, only a hard reset recovers" symptom. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2149963 ## Attachments to include when filing - Output of `ubuntu-bug linux` (for current package/version/apport metadata) — note: current `dmesg` will NOT contain the incident logs, since a hard reset clears the kernel ring buffer. - Full `journalctl` excerpts spanning each incident (attach as separate `.txt` files), specifically the windows around each `PM: suspend entry` through the last log line before the gap indicating the hard reset. - Output of `cat /proc/driver/nvidia/params | grep -iE 'Preserve|Kernel|Temp'` showing the modprobe workaround is active. - `nvidia-bug-report.sh` output (run `sudo nvidia-bug-report.sh`, attach the resulting `nvidia-bug-report.log.gz`) if it can be captured after a fresh incident before rebooting away the state. ProblemType: Bug DistroRelease: Ubuntu 26.04 Package: linux-image-7.0.0-27-generic 7.0.0-27.27 ProcVersionSignature: Ubuntu 7.0.0-27.27-generic 7.0.6 Uname: Linux 7.0.0-27-generic x86_64 ApportVersion: 2.34.0-0ubuntu2 Architecture: amd64 AudioDevicesInUse: USER PID ACCESS COMMAND /dev/snd/controlC2: beniu 4376 F.... wireplumber /dev/snd/controlC1: beniu 4376 F.... wireplumber /dev/snd/controlC0: beniu 4376 F.... wireplumber /dev/snd/seq: beniu 4356 F.... pipewire CasperMD5CheckResult: pass CurrentDesktop: ubuntu:GNOME Date: Thu Jul 2 19:32:21 2026 InstallationDate: Installed on 2026-02-21 (131 days ago) InstallationMedia: Ubuntu 24.04.4 LTS "Noble Numbat" - Release amd64 (20260210) MachineType: System manufacturer System Product Name ProcEnviron: LANG=pl_PL.UTF-8 PATH=(custom, no user) SHELL=/bin/bash TERM=xterm-256color XDG_RUNTIME_DIR=<set> ProcFB: 0 nvidia-drmdrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-7.0.0-27-generic root=UUID=625d6275-b0c0-4dac-b793-5b840ac8fbf9 ro quiet splash nvidia-drm.modeset=1 mem_sleep_default=deep usbcore.autosuspend=-1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M RfKill: 2: hci0: Bluetooth Soft blocked: no Hard blocked: no SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 10/03/2025 dmi.bios.release: 5.17 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 5901 dmi.board.asset.tag: Default string dmi.board.name: ROG STRIX B450-F GAMING dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev 1.xx dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr5901:bd10/03/2025:br5.17:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGSTRIXB450-FGAMING:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:pfaTobefilledbyO.E.M.: dmi.product.family: To be filled by O.E.M. dmi.product.name: System Product Name dmi.product.sku: SKU dmi.product.version: System Version dmi.sys.vendor: System manufacturer ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug resolute wayland-session ** Patch added: "journalctl log from freeze boot" https://bugs.launchpad.net/bugs/2158993/+attachment/5979947/+files/log_hardreset2.txt -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2158993 Title: System hard-freezes after failed suspend (Xwayland freeze timeout) with nvidia-driver-595-open — deadlock in nvidia_modeset/console Status in linux package in Ubuntu: New Bug description: # Bug Report: System hard-freezes after failed suspend (Xwayland freeze timeout) with NVIDIA 595-open driver — deadlock between nvidia_modeset and console/fbcon subsystem ## Summary On a desktop system with an NVIDIA RTX 4070 Ti (nvidia- driver-595-open, 595.71.05) running Ubuntu 26.04 "resolute" on kernel 7.0.0-27-generic, attempting to suspend (`mem_sleep_default=deep`, S3) intermittently fails as follows: 1. `Freezing user space processes` fails after the default 20s timeout because a userspace task (observed: `Xwayland`, correlated with an active Steam client) refuses to freeze in time. 2. The kernel aborts the suspend and falls back to `fbcon: Taking over console`. 3. During this fallback, the NVIDIA driver's internal memory-management code (`NVRM: GPU0 nvAssertFailedNoLog`, `kern_bus_gv100.c:388`, `mmu_walk*.c`) begins emitting a continuous stream of assertion failures — one burst approximately every 30 seconds — indicating GPU virtual-address-space mapping has entered a broken state. 4. A kernel worker thread (`kworker/0:1`, workqueue `fbcon_register_existing_fbs`) becomes stuck for 245+ seconds inside `nvkms_ioctl_from_kapi` / `GetDynamicDisplayInfo`, waiting on an rwsem held (apparently) by the NVIDIA driver. 5. `systemd-sleep`, attempting `pm_restore_console` as part of aborting the suspend, blocks indefinitely on `console_lock`, which the kernel explicitly reports as "likely last held by task kworker/0:1:11" — i.e. the same stuck worker from step 4. 6. This is a genuine circular-wait deadlock between the console/VT subsystem and the NVIDIA kernel module, not a simple slow device. 7. The system does not always crash immediately after this deadlock — it can continue running for several hours in a visibly degraded state (other services such as `cups.service` and `fwupd-refresh.service` begin entering restart loops, killed repeatedly with SIGKILL and never exiting cleanly) before eventually becoming totally unresponsive and requiring a hard power-button reset. No further kernel log entries are written between the last responsive log line and the forced reboot, consistent with a full system lockup rather than a clean panic. This has now been observed and diagnosed across three separate incidents over the space of about 48 hours, all sharing the same signature (failed freeze → NVRM assertion cascade → eventual hard freeze requiring power-cycle). ## System Information - **Ubuntu release:** 26.04 "resolute" (resolute-updates, resolute-security) - **Kernel:** 7.0.0-27-generic - **CPU:** AMD Ryzen 7 5700X3D - **GPU:** NVIDIA RTX 4070 Ti (PCI ID 10DE:2782, subsystem 1462:5132) - **NVIDIA driver package:** nvidia-driver-595-open, version 595.71.05-0ubuntu0.26.04.1 (also nvidia-driver-590-open installed but not active/default) - **Display stack:** GNOME on Wayland (Xwayland for X11 apps) - **Kernel command line (GRUB_CMDLINE_LINUX_DEFAULT):** ``` quiet splash nvidia-drm.modeset=1 mem_sleep_default=deep usbcore.autosuspend=-1 ``` - **`/sys/power/mem_sleep`:** `s2idle [deep]` (deep in use) ## Workarounds already tried 1. **Switched sleep mode from `deep` (S3) to `s2idle`** — did **not** prevent the freeze-timeout/deadlock pattern; the underlying trigger (Xwayland refusing to freeze) is independent of the ACPI sleep mode. 2. **Added `/etc/modprobe.d/nvidia-suspend-fix.conf`:** ``` options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp options nvidia NVreg_UseKernelSuspendNotifiers=0 ``` Confirmed active via `/proc/driver/nvidia/params` after reboot. This reduced the frequency of memory-corruption symptoms somewhat but **did not eliminate** the underlying freeze-timeout → NVRM assertion → deadlock sequence; it recurred on a subsequent night with this configuration active. ## Steps to Reproduce (best current understanding) 1. Have an X11/Xwayland-backed application under load at the moment the system is asked to suspend — in the two clearest captures, a Steam client was active and had logged `CSteamEngine::BMainLoop appears to have stalled > 15 seconds` at the same timestamp as the freeze failure. 2. Trigger suspend (automatic idle suspend via `systemd-logind`/GNOME power settings, or manual). 3. Kernel begins `Freezing user space processes`; if Xwayland does not freeze within 20s, the kernel aborts the freeze and logs `Freezing user space processes failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0)`. 4. From this point, NVRM assertion-failure spam begins, and eventually the fbcon/console_lock deadlock described above can occur. 5. System may continue running in a degraded state for hours before a full hard freeze. ## Expected Behavior If a task refuses to freeze in time, the kernel should cleanly abort the suspend and return the system to a fully functional state, without leaving the NVIDIA driver or the console subsystem in a corrupted/deadlocked state. ## Actual Behavior The abort path itself deadlocks: `nvidia_modeset`'s internal locking (exercised via the `fbcon_register_existing_fbs` workqueue triggered by `fbcon: Taking over console`) contends with `systemd-sleep`'s own console-restore path (`pm_restore_console` → `console_lock`), and the two can end up blocking on each other. The system may appear to "recover" superficially but is left in a state that leads to cascading failures (other systemd services entering unkillable SIGKILL-retry loops) and, eventually, a full unrecoverable freeze. ## Log Evidence (key excerpts, kernel 7.0.0-27-generic) ``` kernel: PM: suspend entry (deep) kernel: Filesystems sync: 0.353 seconds kernel: Freezing user space processes kernel: Freezing user space processes failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0): kernel: fbcon: Taking over console kernel: task:Xwayland state:R running task stack:0 pid:5165 tgid:5165 ppid:4743 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NULL != pIter->pMap @ virt_mem_allocator_gm107.c:2024 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: progress == entryIndexHi - entryIndexLo + 1 @ mmu_walk_map.c:170 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:541 kernel: NVRM: GPU0 mmuWalkMap: Failed to map VA Range 0x2f000000 to 0x2f1fffff. Status = 0x00000040 kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL) @ kern_bus_gv100.c:388 [... this block repeats roughly every 30 seconds for several minutes ...] kernel: INFO: task kworker/0:1:11 blocked for more than 245 seconds. kernel: Tainted: G O 7.0.0-27-generic #27-Ubuntu kernel: task:kworker/0:1 state:D stack:0 pid:11 tgid:11 ppid:2 kernel: Workqueue: events fbcon_register_existing_fbs kernel: Call Trace: kernel: rwsem_down_read_slowpath+... kernel: down_read+0x48/0xd0 kernel: nvkms_ioctl_from_kapi+0xdc/0xf0 [nvidia_modeset] kernel: GetDynamicDisplayInfo+0x9c/0x190 [nvidia_modeset] ... kernel: INFO: task systemd-sleep:35518 blocked for more than 245 seconds. kernel: task:systemd-sleep state:D stack:0 pid:35518 tgid:35518 ppid:1 kernel: Call Trace: kernel: down+0x5e/0x80 kernel: console_lock+0x2f/0x50 kernel: vt_move_to_console+0x19/0xb0 kernel: pm_restore_console+0x4d/0x70 kernel: enter_state+0x120/0x610 kernel: pm_suspend+0x49/0x90 kernel: INFO: task systemd-sleep:35518 blocked on a semaphore likely last held by task kworker/0:1:11 ``` Later the same night, unrelated services begin failing identically (repeated SIGKILL, never exiting): ``` systemd[1]: cups.service: start operation timed out. Terminating. systemd[1]: cups.service: State 'stop-sigterm' timed out. Killing. systemd[1]: cups.service: Killing process 35995 (9) with signal SIGKILL. systemd[1]: cups.service: Processes still around after SIGKILL. Ignoring. [cycle repeats ~35 times over 4+ hours] ``` No further journal entries follow the last cycle; the machine was unresponsive and required a hard power-button reset. ## Note on driver versions checked At the time of filing, NVIDIA's production branch had advanced to 595.84 (released 2026-06-17), one release ahead of the 595.71.05 installed here. I checked whether 595.84 was available as a packaged driver for Ubuntu 26.04 "resolute" before filing, to rule out that this was already fixed: - `resolute-updates` / `resolute-security` (official Ubuntu archive): only 595.71.05 - `ppa:graphics-drivers/ppa`: only 595.71.05 for `resolute` - NVIDIA's own CUDA apt repository (`developer.download.nvidia.com/compute/cuda/repos/ubuntu2604`): does not carry a `nvidia-driver-595-open` package matching 595.84 either (only unrelated tooling from the newer 610.x feature branch, e.g. `nvidia-settings`/`libxnvctrl0` 610.43.02) 595.84 is therefore only available as NVIDIA's `.run` installer for this Ubuntu release at present, which was intentionally not used here to avoid DKMS/Secure Boot conflicts with the distro-packaged driver. This report is filed against 595.71.05; it is not yet known whether 595.84 resolves the issue. ## Note on driver versions checked At the time of filing, NVIDIA's production branch had advanced to 595.84 (released 2026-06-17), one release ahead of the 595.71.05 installed here. I checked whether 595.84 was available as a packaged driver for Ubuntu 26.04 "resolute" before filing, to rule out that this was already fixed: - `resolute-updates` / `resolute-security` (official Ubuntu archive): only 595.71.05 - `ppa:graphics-drivers/ppa`: only 595.71.05 for `resolute` - NVIDIA's own CUDA apt repository (`developer.download.nvidia.com/compute/cuda/repos/ubuntu2604`): does not carry a `nvidia-driver-595-open` package matching 595.84 either (only unrelated tooling from the newer 610.x feature branch, e.g. `nvidia-settings`/`libxnvctrl0` 610.43.02) 595.84 is therefore only available as NVIDIA's `.run` installer for this Ubuntu release at present, which was intentionally not used here to avoid DKMS/Secure Boot conflicts with the distro-packaged driver. This report is filed against 595.71.05; it is not yet known whether 595.84 resolves the issue. The 595.84 changelog lists "Fixed a bug that could cause suspend and resume to fail on systems with runtime D3 (RTD3) power management enabled." I checked whether this applies here: ``` $ cat /proc/driver/nvidia/gpus/*/power Runtime D3 status: Disabled by default ... ``` RTD3 is disabled by default on this system (single desktop GPU with a directly-attached display, no hybrid/Optimus setup), so this specific changelog entry likely does not describe the same bug — the deadlock documented below appears unrelated to RTD3 and is filed as a distinct issue. ## Possibly related upstream reports - Ubuntu Launchpad bug **#2149963** (package `linux`) — RTX 50-series + nvidia-open 595/580 on kernel 7.0.0-14-generic, Ubuntu 26.04 "resolute": s2idle resume never completes after lid-close, requires hard reset. Same kernel/driver generation, same distro release; different GPU generation and different sleep mode (s2idle vs deep here), but the same overall "suspend/resume path never returns, only a hard reset recovers" symptom. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2149963 ## Attachments to include when filing - Output of `ubuntu-bug linux` (for current package/version/apport metadata) — note: current `dmesg` will NOT contain the incident logs, since a hard reset clears the kernel ring buffer. - Full `journalctl` excerpts spanning each incident (attach as separate `.txt` files), specifically the windows around each `PM: suspend entry` through the last log line before the gap indicating the hard reset. - Output of `cat /proc/driver/nvidia/params | grep -iE 'Preserve|Kernel|Temp'` showing the modprobe workaround is active. - `nvidia-bug-report.sh` output (run `sudo nvidia-bug-report.sh`, attach the resulting `nvidia-bug-report.log.gz`) if it can be captured after a fresh incident before rebooting away the state. ProblemType: Bug DistroRelease: Ubuntu 26.04 Package: linux-image-7.0.0-27-generic 7.0.0-27.27 ProcVersionSignature: Ubuntu 7.0.0-27.27-generic 7.0.6 Uname: Linux 7.0.0-27-generic x86_64 ApportVersion: 2.34.0-0ubuntu2 Architecture: amd64 AudioDevicesInUse: USER PID ACCESS COMMAND /dev/snd/controlC2: beniu 4376 F.... wireplumber /dev/snd/controlC1: beniu 4376 F.... wireplumber /dev/snd/controlC0: beniu 4376 F.... wireplumber /dev/snd/seq: beniu 4356 F.... pipewire CasperMD5CheckResult: pass CurrentDesktop: ubuntu:GNOME Date: Thu Jul 2 19:32:21 2026 InstallationDate: Installed on 2026-02-21 (131 days ago) InstallationMedia: Ubuntu 24.04.4 LTS "Noble Numbat" - Release amd64 (20260210) MachineType: System manufacturer System Product Name ProcEnviron: LANG=pl_PL.UTF-8 PATH=(custom, no user) SHELL=/bin/bash TERM=xterm-256color XDG_RUNTIME_DIR=<set> ProcFB: 0 nvidia-drmdrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-7.0.0-27-generic root=UUID=625d6275-b0c0-4dac-b793-5b840ac8fbf9 ro quiet splash nvidia-drm.modeset=1 mem_sleep_default=deep usbcore.autosuspend=-1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M RfKill: 2: hci0: Bluetooth Soft blocked: no Hard blocked: no SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 10/03/2025 dmi.bios.release: 5.17 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 5901 dmi.board.asset.tag: Default string dmi.board.name: ROG STRIX B450-F GAMING dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev 1.xx dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr5901:bd10/03/2025:br5.17:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGSTRIXB450-FGAMING:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:pfaTobefilledbyO.E.M.: dmi.product.family: To be filled by O.E.M. dmi.product.name: System Product Name dmi.product.sku: SKU dmi.product.version: System Version dmi.sys.vendor: System manufacturer To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158993/+subscriptions

Комментариев нет:

Отправить комментарий