вторник

[Bug 2076406] Comment bridged from LTC Bugzilla

------- Comment From sthoufee@in.ibm.com 2024-10-16 02:18 EDT-------
Test team will be able to conclude and validate the bug soon as possible. will update the bug.

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2076406

Title:
L2 Guest migration: continuously dumping while running NFS guest
migration

Status in The Ubuntu-power-systems project:
Fix Committed
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Noble:
Fix Committed
Status in linux source package in Oracular:
Fix Released

Bug description:
SRU Justification:

[ Impact ]

* While doing ISST testing it turned out that a 2nd level (KVM)
guest (aka VM) continuously dumped when running an NFS
guest migration.

[ Test Plan ]

* Setup two IBM Power 10 system (with firmware 1060, that offers
support for KVM) with Ubuntu Server 24.04 for ppc64el.

* Setup qemu/KVM on both on these system to allow guest migration.

* Setup a KVM guest and place its disk on an NFS volume.

* Now initiate a guest migration.

* Without the two patches the initiator system will start to dump.

* Since this setup requires a special firmware level,
the verification will be done by the IBM Power team.

[ Where problems could occur ]

* Although the patch set looks huge,
the patches themselves are relatively small and less invasive
and I would consider them mainly as fixes.

* kvmppc_set_one_reg_hv() wrongly get() the value instead of
set() for MMCR3.

* And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
the SIAR instead of SDAR - which is quite traceable.

* Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
is introduced. Here issues can happen if the initialization
is done wrong or in the case statement.
A fix was added to keep nested guest DEXCR in sync.
The guest state element defined for DEXCR was already there,
but not really considered - this is fixed now (DEXCR GSID).
If initialization was done wrong or code in case stmt,
this can harm the guest state.
Guest state may get out of sync.

* Another one-reg register identifier was introduced
that is used to read and set the virtual HASHKEYR
for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
Again initialization and the case code are critical.
Code was added to keep nested guest HASHKEYR in sync.
Again the state element defined for HASHKEYR was there,
but not considered, what is fixed now (HASHKEYR GSID)
If initialization was done wrong or code in case stmt,
this can harm the guest state.
This can harm the L2 guest during enter or exit.

* Again another one-reg identifier was introduced
that is used to read and set the virtual HASHPKEYR
for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
And again the guest state element defined for HASHPKEYR
was there but ignored which is now fixed (HASHPKEYR GSID).
If initialization was done wrong or code in case stmt,
this can harm the guest state.
This can harm the L2 guest during enter or exit.

[ Other Info ]

* Since (nested) KVM support is new on P10,
this does not affect older Power generation
(P9 is the only other hw generation that is supported by 24.04,
but it only supports native virtualization).

* Both patches are upstream accepted since v6.11(-rc1),
hence will be in oracular
and are also upstream tagged as stable updates.

* Since the required firmware FW1060 is relatively new,
we can assume that not many user ran into this issue yet.
__________

== Comment: #0 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2024-08-09 03:50:24 ==
+++ This bug was initially created as a clone of Bug #206737 +++

---Problem Description---
L2 Guest migration: evelp2g4[L2]: while running NFS guest migration continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}

---uname output---
NA

Machine Type = NA

Contact Information = NA

[79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[79205.163834] NIP: c0000000002bb7a4 LR: c0000000002bb750 CTR: c0000000000d192c
[79205.163929] REGS: c0000003871cf1b0 TRAP: 0900 Tainted: G L
[79205.165041] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44042222 XER: 20040004
[79205.165266] CFAR: 0000000000000000 IRQMASK: 0
               GPR00: c0000000002bbc58 c0000003871cf450 c0000000020ded00 0000000000000009
               GPR04: 0000000000000009 0000000000000009 0000000000000080 0000000000000200
               GPR08: 00000000000001ff 0000000000000001 c000000740f57ee0 0000000044048222
               GPR12: c0000000000d192c c000000743ddc980 0000000000000000 0000000000000000
               GPR16: 0000000000000000 c00000000d86e200 0000000000000001 0000000000000001
               GPR20: 000000000000000c c000000003d06188 c0000000000ac4d0 c00000000a374e00
               GPR24: c000000003d06840 0000000000000000 c000000741193188 c000000741193188
               GPR28: c000000741193180 c000000003d06840 0000000000000048 0000000000000009
[79205.171660] NIP [c0000000002bb7a4] smp_call_function_many_cond+0x1e0/0x738
[79205.171752] LR [c0000000002bb750] smp_call_function_many_cond+0x18c/0x738
[79205.171835] Call Trace:
[79205.171869] [c0000003871cf450] [c0000000002bbc58] smp_call_function_many_cond+0x694/0x738 (unreliable)
[79205.171986] [c0000003871cf520] [c0000000000ac4d0] radix__tlb_flush+0x4c/0x140
[79205.173636] [c0000003871cf560] [c00000000052e900] tlb_finish_mmu+0x130/0x1f0
[79205.173754] [c0000003871cf590] [c00000000052a280] exit_mmap+0x1cc/0x574
[79205.173848] [c0000003871cf6c0] [c00000000016ec9c] __mmput+0x54/0x1d4
[79205.173939] [c0000003871cf6f0] [c0000000006385c4] begin_new_exec+0x6dc/0xefc
[79205.174037] [c0000003871cf780] [c0000000006edea8] load_elf_binary+0x4c8/0x1a50
[79205.174136] [c0000003871cf880] [c0000000006361c8] bprm_execve+0x2b4/0x7a0
[79205.174219] [c0000003871cf950] [c000000000637988] do_execveat_common+0x1c0/0x2d8
[79205.174316] [c0000003871cf9f0] [c000000000638e38] sys_execve+0x54/0x6c
[79205.174399] [c0000003871cfa20] [c00000000002fec8] system_call_exception+0x168/0x310
[79205.174497] [c0000003871cfe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[79205.176245] --- interrupt: 3000 at 0x7fff95b10b08
[79205.176326] NIP: 00007fff95b10b08 LR: 00007fff95b10b08 CTR: 0000000000000000
[79205.176438] REGS: c0000003871cfe80 TRAP: 3000 Tainted: G L (
[79205.176558] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48044424 XER: 00000000
[79205.176686] IRQMASK: 0
               GPR00: 000000000000000b 00007fffe6919aa0 00007fff95c47c00 0000000152598c80
               GPR04: 00007fffe6919bf8 00000001525db6e0 ffffffffffffffff 00007fffe6919a20
               GPR08: 0000000152598c88 0000000000000000 0000000000000000 0000000000000000
               GPR12: 0000000000000000 00007fff969a4220 0000000152585570 0000000000000000
               GPR16: 00007fffe6919c48 0000000000000570 0000000152598c80 0000000000000000
               GPR20: 0000000000000000 0000000000009998 000000015259a450 0000000152586460
               GPR24: 00000001525bca90 00007fffe6919e48 0000000000000000 00000001525db6e0
               GPR28: 0000000117e98448 00000001525d0b00 0000000000000000 0000000000100000
[79205.177505] NIP [00007fff95b10b08] 0x7fff95b10b08
[79205.177578] LR [00007fff95b10b08] 0x7fff95b10b08
[79205.177649] --- interrupt: 3000

Steps to reproduce: Install the build on NFS storage guest kernel
6.8.10-300

Start the HTX workload - mdt.less

Start the NFS guest migration between the L2 hosts.

Sourece L2 host : evelp2
Target L2 host : rinlp1

migration command : virsh migrate --live --domain $vm_name
qemu+ssh://$target_host/system --verbose --undefinesource --persistent
--timeout 120

Share the same NFS storage between two hosts [here /kvm_pool]
10.33.4.52:/kvm_pool nfs4 650G 304G 347G 47% /kvm_pool

Test running : HTX

Guest state : up

 -------------------------------------------------------------------------------------
--------------------------------------

L2 guest Config:

(1) Problem on Guest: evelp2g4

(2) PHYP/ Processor Type: KVM/P10/Everest

(3) Rootvg Filesystem: EXT4

(5) Network Bridge: Macvtap

(6) IO Disk Type/Driver: qemu-img/ qcow2

(7) Install Disk Type: Single

 -------------------------------------------------------------------------------------
--------------------------------------

L1 host details :

MDC mode : off

(1) PHYP/ Processor Type: KVM/P10/Everest

(2) CEC Name: evelp2

(3) Rootvg Filesystem: xfs

(5) Network Interface: Dedicated Network

(6) IO Type: NVME

(8) Multipath Enabled: no

(9) Install Disk Type: Single

(10) MMU: RPT

The kernel patches are at
https://lore.kernel.org/kvm/D1SLOYCQGIQ6.17Y5C9XJDHX33@gmail.com/T/#t

Qemu patches are at
https://lore.kernel.org/qemu-devel/171760304518.1127.12881297254648658843.stgit@ad1b393f0e09/

powerpc/topic/ppc-kvm.

[1/8] KVM: PPC: Book3S HV: Fix the set_one_reg for MMCR3
https://git.kernel.org/powerpc/c/f9ca6a10be20479d526f27316cc32cfd1785ed39
[2/8] KVM: PPC: Book3S HV: Fix the get_one_reg of SDAR
https://git.kernel.org/powerpc/c/009f6f42c67e9de737d6d3d199f92b21a8cb9622
[3/8] KVM: PPC: Book3S HV: Add one-reg interface for DEXCR register
https://git.kernel.org/powerpc/c/1a1e6865f516696adcf6e94f286c7a0f84d78df3
[4/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest DEXCR in sync
https://git.kernel.org/powerpc/c/2d6be3ca3276ab30fb14f285d400461a718d45e7
[5/8] KVM: PPC: Book3S HV: Add one-reg interface for HASHKEYR register
https://git.kernel.org/powerpc/c/e9eb790b25577a15d3f450ed585c59048e4e6c44
[6/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest HASHKEYR in sync
https://git.kernel.org/powerpc/c/1e97c1eb785fe2dc863c2bd570030d6fcf4b5e5b
[7/8] KVM: PPC: Book3S HV: Add one-reg interface for HASHPKEYR register
https://git.kernel.org/powerpc/c/9a0d2f4995ddde3022c54e43f9ece4f71f76f6e8
[8/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest HASHPKEYR in sync
https://git.kernel.org/powerpc/c/0b65365f3fa95c2c5e2094739151a05cabb3c48a

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076406/+subscriptions

Комментариев нет:

Отправить комментарий