четверг

[Bug 2155222] Re: [hyperv] Ensure MMIO Mapping is Correct for Kexec / kdump kernel on Azure v6 Instance Types

** Tags added: kernel-daily-bug -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2155222 Title: [hyperv] Ensure MMIO Mapping is Correct for Kexec / kdump kernel on Azure v6 Instance Types Status in linux package in Ubuntu: Fix Committed Status in linux source package in Jammy: In Progress Status in linux source package in Noble: In Progress Status in linux source package in Questing: In Progress Status in linux source package in Resolute: In Progress Bug description: BugLink: https://bugs.launchpad.net/bugs/2155222 [Impact] Jammy VMs running on "Gen2" v6 instance types on Azure fail to collect a kdump with both the 5.15 and 6.8 HWE kernel, yet kdump succeeds for 6.8 onward on noble onward. Even stranger, it succeeds on jammy with secureboot enabled, and fails with secureboot disabled. The difference between jammy and noble onward can be explained with userspace tools, as kdump-tools uses -c (--kexec-syscall) by default, and changes to -s (--kexec-file-syscall) when secureboot is enabled. Noble onward works due to using -a (--kexec-syscall-auto) by default, which defaults to -s. Noble will fail when using -c instead. From man kexec: -s (--kexec-file-syscall)       Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively. -c (--kexec-syscall)       Specify that the old KEXEC_LOAD syscall should be used exclusively (the default). -a (--kexec-syscall-auto)       Try the new KEXEC_FILE_LOAD syscall first and when it is not supported or the kernel does not understand the supplied image fall back to the old       KEXEC_LOAD interface.       There is no one single interface that always works.       KEXEC_FILE_LOAD is required on systems that use locked-down secure boot to verify the kernel signature. KEXEC_LOAD may be also disabled in the kernel       configuration.       KEXEC_LOAD is required for some kernel image formats and on architectures that do not implement KEXEC_FILE_LOAD. Regardless, the issue is actually a hyperv subsystem issue in the kernel. When the kexec / kdump kernel boots, vmbus_reserve_fb() fails to reserve the framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero. This causes a MMIO conflict between hyperv-drm and pci-hyperv: when the pci-hyperv's hv_allocate_config_window() calls vmbus_allocate_mmio() to get an MMIO range, it usually gets a 32-bit MMIO range that overlaps with the framebuffer MMIO range, and later hv_pci_enter_d0() fails with an error message "PCI Pass-through VSP failed D0 Entry with status" since the host thinks that PCI devices must not use MMIO space that the host has assigned to the framebuffer. This is especially an issue if pci-hyperv is built-in and hyperv-drm is built as a module. Consequently, the kdump/kexec kernel fails to detect PCI devices via pci-hyperv, and may fail to mount the root file system, which may reside in a NVMe disk. The end result is that capturing kdumps fail when -c (--kexec-syscall) is used, which is the default on jammy. [Fix] This is currently queued up in the hyperv maintainer tree, in the hyperv-fixes branch: commit 016a25e4b0df4d77e7c258edee4aaf982e4ee809 hyperv From: Dexuan Cui <decui@microsoft.com> Date: Thu, 7 May 2026 14:28:38 -0700 Subject: Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs Link: https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-fixes&id=016a25e4b0df4d77e7c258edee4aaf982e4ee809 This is expected to make the 7.2 merge window. This fix is required for hyperv users, and is mostly relevant for -azure users only, but I am still requesting this for -generic to ensure that anyone using -generic on Azure can still kexec, and to make it easier to bisect -generic on Azure in the future. [Testcase] This needs to be tested on Azure on both v5 and v6 instance types. The issue occurs with v6 instance types, but we need to ensure we do not cause a regression with v5 instance types. For each series you are testing, create a VM with the following instance types: - Standard_D4ads_v5 - Standard_D4ads_v6 For the image type, you need to select "Gen2" images: - "Ubuntu Server 22.04 LTS - x64 Gen2" - "Ubuntu Server 24.04 LTS - x64 Gen2" - "Ubuntu Server 25.10 - x64 Gen 2" - "Ubuntu Server 26.04 LTS - x64 Gen 2" If you are going to test with -c (--kexec-syscall), secureboot needs to be disabled, and you can do this with: - Under Security type, select "Configure security features" - uncheck "Enable Secure Boot". Save. Create the VM. Log in, and install kdump-tools: $ sudo apt update $ sudo apt install kdump-tools Say yes to each prompt. $ sudo vim /etc/default/grub.d/kdump-tools.cfg Change crashkernel=512M-:192M from 192M to 1G, save, exit. $ sudo vim /etc/kernel/postinst.d/kdump-tools Change dep to most, save exit. $ sudo update-grub $ sudo reboot Verify that the cmdline has crashkernel set to 1G memory: $ cat /proc/cmdline $ kdump-config show On the Azure Web Interface, select "Serial Console" for the VM, and watch the serial console. $ sudo sysctl -w kernel.sysrq=1 $ sudo su $ echo c > /proc/sysrq-trigger Watch the kernel panic and reboot into the crash kernel. On failure: The kexec kernel gets stuck, and writes these messages to dmesg. [ 1.157729] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: PCI VMBus probing: Using version 0x10004 [ 1.167427] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: Retrying D0 Entry [ 1.173231] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: PCI Pass-through VSP failed D0 Entry with status c000000d [ 1.181091] hv_vmbus: probe failed for device 7ad35d50-c05b-47ab-b3a0-56a9a845852b (-71) [ 1.186890] hv_pci: probe of 7ad35d50-c05b-47ab-b3a0-56a9a845852b failed with error -71 [ 1.194422] hv_pci 00000001-7870-47b5-b203-907d12ca697e: PCI VMBus probing: Using version 0x10004 [ 1.202172] hv_pci 00000001-7870-47b5-b203-907d12ca697e: Retrying D0 Entry [ 1.207877] hv_pci 00000001-7870-47b5-b203-907d12ca697e: PCI Pass-through VSP failed D0 Entry with status c000000d The kexec kernel gives up, and reboots. No kdump is generated. /var/crash will be empty. On success: The kdump is collected, and saved to /var/crash, and will be present on next boot. There are test kernels available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf425760-test If you install the test kernel and reboot, kdump will work correctly on v6 instance types. [Where problems could occur] This changes how vmbus_reserve_fb() reserves MMIO space for the framebuffer, and if a regression were to occur, it could affect the pci-hyperv and hyperv-drm drivers from being able to claim the correct MMIO ranges. This could show as instances failing to start or failing to kexec / collect a kdump with the crashkernel. This fix works both on amd64 and arm64 instance types, as well as with 32bit and 64bit pci busses. [Other info] Upstream mailing list threads: Abandoned Patch: V1: https://lore.kernel.org/linux-hyperv/20260122020337.94967-1-decui@microsoft.com/ V2: https://lore.kernel.org/linux-hyperv/20260402234313.2490779-1-decui@microsoft.com/ Current Patch: V1: https://lore.kernel.org/linux-hyperv/20260416183529.838321-1-decui@microsoft.com/ V2: https://lore.kernel.org/linux-hyperv/20260505004846.193441-1-decui@microsoft.com/ V3: https://lore.kernel.org/linux-hyperv/20260507212838.448891-1-decui@microsoft.com/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2155222/+subscriptions

Комментариев нет:

Отправить комментарий