[РЕШЕНО] Ошибка № ... : [Bug 2067862] Re: Removing legacy virtio-pci devices causes kernel panic

Hi Dong,

Thanks for trying the test kernel and letting me know it works. And for
the help with the testcase.

I have submitted the patch to the Ubuntu Kernel Team mailing list:

Cover letter:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151550.html
Patch:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151551.html

The next step is for it to be reviewed by senior members of the Kernel Team.
If it gets accepted, it will likely be in the 2024.07.08 SRU cycle as per
https://kernel.ubuntu.com/.

I will write back once the patch has been reviewed by the kernel team.

Thanks,
Matthew

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2067862

Title:
Removing legacy virtio-pci devices causes kernel panic

Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Noble:
In Progress

Bug description:
BugLink: https://bugs.launchpad.net/bugs/2067862

[Impact]

If you detach a legacy virtio-pci device from a current Noble system,
it will cause a null pointer dereference, and panic the system. This
is an issue if you force noble to use legacy virtio-pci devices, or
run noble on very old hypervisors that only support legacy virtio-pci
devices, e.g. trusty and older.

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
CPU: 2 PID: 358 Comm: kworker/u8:3 Kdump: loaded Not tainted 6.8.0-31-generic #31-Ubuntu
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:0x0
...
Call Trace:
<TASK>
? show_regs+0x6d/0x80
? __die+0x24/0x80
? page_fault_oops+0x99/0x1b0
? do_user_addr_fault+0x2ee/0x6b0
? exc_page_fault+0x83/0x1b0
? asm_exc_page_fault+0x27/0x30
vp_del_vqs+0x6e/0x2a0
remove_vq_common+0x166/0x1a0
virtnet_remove+0x61/0x80
virtio_dev_remove+0x3f/0xc0
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
bus_remove_device+0xcb/0x140
device_del+0x161/0x3e0
? pci_bus_generic_read_dev_vendor_id+0x2c/0x1a0
device_unregister+0x17/0x60
unregister_virtio_device+0x16/0x40
virtio_pci_remove+0x43/0xa0
pci_device_remove+0x36/0xb0
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
pci_stop_bus_device+0x7a/0xb0
pci_stop_and_remove_bus_device+0x12/0x30
disable_slot+0x4f/0xa0
acpiphp_disable_and_eject_slot+0x1c/0xa0
hotplug_event+0x11b/0x280
? __pfx_acpiphp_hotplug_notify+0x10/0x10
acpiphp_hotplug_notify+0x27/0x70
acpi_device_hotplug+0xb6/0x300
acpi_hotplug_work_fn+0x1e/0x40
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? _raw_spin_lock_irqsave+0xe/0x20
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

The issue was introduced in:

commit fd27ef6b44bec26915c5b2b22c13856d9f0ba17a
Author: Feng Liu <feliu@nvidia.com>
Date: Tue Dec 19 11:32:40 2023 +0200
Subject: virtio-pci: Introduce admin virtqueue
Link: https://github.com/torvalds/linux/commit/fd27ef6b44bec26915c5b2b22c13856d9f0ba17a

Modern virtio-pci devices are not affected. If the device is a legacy
virtio device, the is_avq function pointer is not assigned in the
virtio_pci_device structure of the legacy virtio device, resulting in
a NULL pointer dereference when the code calls if
(vp_dev->is_avq(vdev, vq->index)).

There is no workaround. If you are affected, then not detaching
devices for the time being is the only solution.

[Fix]

This was fixed in 6.9-rc1 by:

commit c8fae27d141a32a1624d0d0d5419d94252824498
From: Li Zhang <zhanglikernel@gmail.com>
Date: Sat, 16 Mar 2024 13:25:54 +0800
Subject: virtio-pci: Check if is_avq is NULL
Link: https://github.com/torvalds/linux/commit/c8fae27d141a32a1624d0d0d5419d94252824498

This is a clean cherry pick to noble. The commit just adds a basic
NULL pointer check before it dereferences the pointer.

[Testcase]

Start a fresh Noble VM.

Edit the grub kernel command line:

1) sudo vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="virtio_pci.force_legacy=1"
2) sudo update-grub
3) sudo reboot

Outside the VM, on the host:

$ qemu-img create -f qcow2 /root/share-device.qcow2 2G
$ cat >> share-device.xml << EOF
disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='writeback' io='threads'/>
<source file='/root/share-device.qcow2'/>
<target dev='vdc' bus='virtio'/>
</disk>
EOF
$ sudo -s
# virsh attach-device noble-test share-device.xml --config --live
# virsh detach-device noble-test share-device.xml --config --live

A kernel panic should occur.

There is a test kernel available in:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2067862-test

If you install it, the panic should no longer occur.

[Where problems could occur]

We are adding a basic null pointer check right before the pointer is
about to be used, which is quite low risk.

If a regression were to occur, it would only affect VMs using legacy
virtio-pci devices, which is not the default. It would potentially
have large impacts on fleets of very old hypervisors running trusty,
precise or lucid, but that is very unlikely in this day and age.

[Other Info]

Upstream mailing list discussion and author testcase:
https://lore.kernel.org/kvm/CACGkMEs1t-ipP7TasHkKNKd=peVEES6Xdw1zSsJkb-bc9Etx9Q@mail.gmail.com/T/#m167335bf7ab09b12fec3bdc5d46a30bc2e26cac7

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2067862/+subscriptions

[РЕШЕНО] Ошибка № ...

понедельник

[Bug 2067862] Re: Removing legacy virtio-pci devices causes kernel panic

Комментариев нет:

Отправить комментарий