среда

[Bug 1797367] Re: Ubuntu 18.04.1 - [s390x] Kernel panic while stressing network bonding

** Description changed:

- == Comment: #0 - Athira Rajeev
+ == SRU Justification ==
+
+ While running a series of stress tests for network on a bond device on Ubuntu 18.04.1 with kernel 4.15.0-36.39,
+ kernel panic is observed (btw. also on non-bond devices).
+ This looks like a race between disabling a qeth device and accessing debugfs.
+ This is critical and leads repeatedly to a crash (sooner or later).
+
+ == Fix ==
+
+ e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")
+
+ pre-reqs:
+ 750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
+ d857e11 ("s390/qeth: remove outdated portname debug msg")
+ 9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
+ 8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
+ 4641b02 ("s390/qeth: don't keep track of MAC address's cast type")
+
+ == Regression Potential ==
+
+ Low, because:
+
+ - limited to s390x
+ - and again limited to qeth driver
+ - patches a problem identified during testing
+ - fix was tested by IBM before submitted
+
+ == Test Case ==
+
+ run:
+ #!/bin/bash
+ var=0
+ while :
+ do
+ var=$((var + 1))
+ echo "DBG count is $var"
+ mkdir /tmp/DBGINFO
+ dbginfo.sh -d /tmp/DBGINFO
+ rm -rf /tmp/DBGINFO*
+ echo "chzdev now is $var"
+ chzdev -e <qeth device>
+ chzdev -d <qeth device>
+ done
+ and in avg. in less than 20 cycles a crash happens (usually < 10).
+
+ __________
+
+ == Comment: #0 - Athira Rajeev
---Problem Description---
While running a series of stress tests for network bonding on UBUNTU 18.04.1 with kernel 4.15.0-36.39, kernel panic is observed.
There are two instance of panic experienced with the same test procedures one of which indicates to be a kernel BUG.
-
- Contact Information = Athira Rajeev <atrajeev@in.ibm.com>, Waiki Wright < waiki@us.ibm.com >
-
+
+ Contact Information = Athira Rajeev <atrajeev@in.ibm.com>, Waiki Wright
+ < waiki@us.ibm.com >
+
---uname output---
#39-Ubuntu SMP Mon Sep 24 16:13:24 UTC 2018 4.15.0-36.39
-
- Machine Type = This issue is observed on z13 system
- ---Debugger---
- A debugger was configured,
-
+
+ Machine Type = This issue is observed on z13 system
+  ---Debugger---
+ A debugger was configured,
+
---Steps to Reproduce---
- This happens while running stress tests for network bonding. kernel memory exposure attempt is detected and the BUG() is called from the code snippet: mm/usercopy.c:72
- dump was configured and crash dump is available.
- Results of few crash commands like bt, log are added in Attachment
+ This happens while running stress tests for network bonding. kernel memory exposure attempt is detected and the BUG() is called from the code snippet: mm/usercopy.c:72
+ dump was configured and crash dump is available.
+ Results of few crash commands like bt, log are added in Attachment

Relevant part of dmesg pointing to kernel BUG

<<>>
[14746.977364] kernel BUG at /build/linux-PABIrW/linux-4.15.0/mm/usercopy.c:72!
[14746.977377] illegal operation: 0001 ilc:1 [#1] SMP
[14746.977378] Modules linked in: macsec vsock_diag vsock sctp_diag sctp dccp_diag dccp tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding binfmt_misc qeth_l3 8021q garp mrp stp llc xt_tcpudp qeth_l2 nf_conntrack_ipv6 nf_defrag_ipv6 scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng ghash_s390 prng sha512_s390 sha256_s390 sha1_s390 sha_common chsc_sch eadm_sch nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32_vx_s390 qeth ccwgroup ip6table_filter ip6_tables vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio iptable_filter sch_fq_codel ip_tables x_tables aes_s390 des_s390 des_generic dm_crypt dm_service_time dm_multipath zfcp scsi_transport_fc qdio dasd_eckd_mod dasd_mod btrfs xor zstd_compress raid6_pq zlib_deflate
[14746.977401] CPU: 1 PID: 20905 Comm: dump2tar Tainted: G OE 4.15.0-36-generic #39-Ubuntu
[14746.977403] Hardware name: IBM 3906 M02 757 (LPAR)
[14746.977404] Krnl PSW : 000000000f2d230d 000000006abe14d5 (__check_object_size+0x15a/0x1e0)
[14746.977408] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[14746.977410] Krnl GPRS: 0000000000000002 0000000000e95334 0000000000000064 00000001e6518828
[14746.977412] 000000000037cc8e 0000000000000000 0000000000a9577c 0000000000000000
[14746.977413] 000000000000647b 00000001d8c120a8 0000000000000001 0000000000008088
[14746.977433] 00000001d8c0a020 000000000090da38 000000000037cc8e 000000016fdfbcd0
[14746.977440] Krnl Code: 000000000037cc82: c0200038ef69 larl %r2,a9ab54
- 000000000037cc88: c0e5fff32838 brasl %r14,1e1cf8
- #000000000037cc8e: a7f40001 brc 15,37cc90
- >000000000037cc92: e330d0080004 lg %r3,8(%r13)
- 000000000037cc98: e320d0000004 lg %r2,0(%r13)
- 000000000037cc9e: ecc2001a4065 clgrj %r12,%r2,4,37ccd2
- 000000000037cca4: b9040013 lgr %r1,%r3
- 000000000037cca8: ec31ff868064 cgrj %r3,%r1,8,37cbb4
+                           000000000037cc88: c0e5fff32838 brasl %r14,1e1cf8
+                          #000000000037cc8e: a7f40001 brc 15,37cc90
+                          >000000000037cc92: e330d0080004 lg %r3,8(%r13)
+                           000000000037cc98: e320d0000004 lg %r2,0(%r13)
+                           000000000037cc9e: ecc2001a4065 clgrj %r12,%r2,4,37ccd2
+                           000000000037cca4: b9040013 lgr %r1,%r3
+                           000000000037cca8: ec31ff868064 cgrj %r3,%r1,8,37cbb4
[14746.977458] Call Trace:
[14746.977460] ([<000000000037cc8e>] __check_object_size+0x156/0x1e0)
[14746.977462] [<000000000010ac40>] debug_output+0x150/0x2f8
[14746.977464] [<00000000004e02c0>] full_proxy_read+0x80/0xe0
[14746.977466] [<0000000000382592>] vfs_read+0x8a/0x150
[14746.977467] [<0000000000382b2e>] SyS_read+0x66/0xe0
[14746.977469] [<00000000008e3c94>] system_call+0xd8/0x2c8
[14746.977470] Last Breaking-Event-Address:
[14746.977472] [<000000000037cc8e>] __check_object_size+0x156/0x1e0
[14746.977473]
<<>>

- Adding one more occurrence of panic_on_oops below which appears to correlate to above .
-
+ Adding one more occurrence of panic_on_oops below which appears to
+ correlate to above .
+
Stack trace output:
Available traces added below
-
+
Oops output:
- [ 2140.467261] 8021q: adding VLAN 0 to HW filter on device bond0
+  [ 2140.467261] 8021q: adding VLAN 0 to HW filter on device bond0
[ 2140.467979] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2140.471609] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2140.471610] 8021q: adding VLAN 0 to HW filter on device bond0
[ 2140.472797] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2143.278986] Unable to handle kernel pointer dereference in virtual kernel address space
[ 2143.278991] Failing address: 7379732f6b657000 TEID: 7379732f6b657803
[ 2143.278993] Fault in home space mode while using kernel ASCE.
[ 2143.278996] AS:0000000000ea0007 R3:0000000000000024
[ 2143.279052] Oops: 0038 ilc:3 [#1] SMP
[ 2143.279055] Modules linked in: bonding 8021q garp mrp stp llc qeth_l3 binfmt_misc macsec vsock_diag vsock sctp_diag sctp dccp_diag dccp tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag xt_tcpudp qeth_l2 nf_conntrack_ipv6 nf_defrag_ipv6 scsi_dh_rdac scsi_dh_emc scsi_dh_alua nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32_vx_s390 ghash_s390 prng sha512_s390 sha256_s390 sha1_s390 sha_common chsc_sch eadm_sch ip6table_filter ip6_tables qeth ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio iptable_filter sch_fq_codel ip_tables x_tables aes_s390 des_s390 des_generic dm_crypt dm_service_time dm_multipath zfcp scsi_transport_fc qdio dasd_eckd_mod dasd_mod btrfs xor zstd_compress raid6_pq zlib_deflate
[ 2143.279099] CPU: 16 PID: 172270 Comm: dump2tar Tainted: G OE 4.15.0-36-generic #39-Ubuntu
[ 2143.279100] Hardware name: IBM 2964 NC9 7A5 (LPAR)
[ 2143.279102] Krnl PSW : 00000000d3630b5f 00000000af8614fc (debug_output+0x188/0x2f8)
[ 2143.279108] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 2143.279110] Krnl GPRS: 0000000000010000 ffffffff000002d8 7379732f6b65726e 00000001db91a020
[ 2143.279112] 0000000000000000 0000000000ea4ac8 00000001db91a020 00000000000009d2
[ 2143.279135] 0000000000000fe5 00000001000ff9ed 00000000000009d2 00000000000009d2
[ 2143.279137] 00000001db91a000 00000001db91a020 000000000010ac54 00000001d16cbd30
[ 2143.279146] Krnl Code: 000000000010ac68: 5810c010 l %r1,16(%r12)
- 000000000010ac6c: ec180063ff7e cij %r1,-1,8,10ad32
- #000000000010ac72: e320c8280004 lg %r2,2088(%r12)
- >000000000010ac78: e33020300002 ltg %r3,48(%r2)
- 000000000010ac7e: a784008f brc 8,10ad9c
- 000000000010ac82: 5a102028 a %r1,40(%r2)
- 000000000010ac86: 5010c010 st %r1,16(%r12)
- 000000000010ac8a: a7391000 lghi %r3,4096
+                           000000000010ac6c: ec180063ff7e cij %r1,-1,8,10ad32
+                          #000000000010ac72: e320c8280004 lg %r2,2088(%r12)
+                          >000000000010ac78: e33020300002 ltg %r3,48(%r2)
+                           000000000010ac7e: a784008f brc 8,10ad9c
+                           000000000010ac82: 5a102028 a %r1,40(%r2)
+                           000000000010ac86: 5010c010 st %r1,16(%r12)
+                           000000000010ac8a: a7391000 lghi %r3,4096
[ 2143.279167] Call Trace:
[ 2143.279169] ([<000000000010ac40>] debug_output+0x150/0x2f8)
[ 2143.279172] [<00000000004e02c4>] full_proxy_read+0x84/0xe0
[ 2143.279175] [<0000000000382592>] vfs_read+0x8a/0x150
[ 2143.279177] [<0000000000382b2e>] SyS_read+0x66/0xe0
[ 2143.279180] [<00000000008e3c98>] system_call+0xdc/0x2c8
[ 2143.279182] Last Breaking-Event-Address:
[ 2143.279184] [<00000000008e7614>] __s390_indirect_jump_r14+0x0/0xc
[ 2143.279185]
[ 2143.279187] Kernel panic - not syncing: Fatal exception: panic_on_oops
-
+
System Dump Location:
- kdump was configured and crash dump is available. since crash dump is huge to be added as bugzilla attachment, results of few crash commands like bt, log will be added in Attachment
-
+  kdump was configured and crash dump is available. since crash dump is huge to be added as bugzilla attachment, results of few crash commands like bt, log will be added in Attachment

- == Comment: #5 - Athira Rajeev
+ == Comment: #5 - Athira Rajeev
Hi,

since crash dump was huge to be added as bugzilla attachment, results of
few crash commands like bt, log were added in the Attachment. Please let
me know if required where to upload the dump files.

Thanks
Athira

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/1797367

Title:
Ubuntu 18.04.1 - [s390x] Kernel panic while stressing network bonding

Status in Ubuntu on IBM z Systems:
In Progress
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Bionic:
In Progress

Bug description:
== SRU Justification ==

While running a series of stress tests for network on a bond device on Ubuntu 18.04.1 with kernel 4.15.0-36.39,
kernel panic is observed (btw. also on non-bond devices).
This looks like a race between disabling a qeth device and accessing debugfs.
This is critical and leads repeatedly to a crash (sooner or later).

== Fix ==

e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")

pre-reqs:
750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
d857e11 ("s390/qeth: remove outdated portname debug msg")
9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
4641b02 ("s390/qeth: don't keep track of MAC address's cast type")

== Regression Potential ==

Low, because:

- limited to s390x
- and again limited to qeth driver
- patches a problem identified during testing
- fix was tested by IBM before submitted

== Test Case ==

run:
#!/bin/bash
var=0
while :
do
var=$((var + 1))
echo "DBG count is $var"
mkdir /tmp/DBGINFO
dbginfo.sh -d /tmp/DBGINFO
rm -rf /tmp/DBGINFO*
echo "chzdev now is $var"
chzdev -e <qeth device>
chzdev -d <qeth device>
done
and in avg. in less than 20 cycles a crash happens (usually < 10).

__________

== Comment: #0 - Athira Rajeev
---Problem Description---
While running a series of stress tests for network bonding on UBUNTU 18.04.1 with kernel 4.15.0-36.39, kernel panic is observed.
There are two instance of panic experienced with the same test procedures one of which indicates to be a kernel BUG.

Contact Information = Athira Rajeev <atrajeev@in.ibm.com>, Waiki
Wright < waiki@us.ibm.com >

---uname output---
#39-Ubuntu SMP Mon Sep 24 16:13:24 UTC 2018 4.15.0-36.39

Machine Type = This issue is observed on z13 system
 ---Debugger---
A debugger was configured,

---Steps to Reproduce---
This happens while running stress tests for network bonding. kernel memory exposure attempt is detected and the BUG() is called from the code snippet: mm/usercopy.c:72
dump was configured and crash dump is available.
Results of few crash commands like bt, log are added in Attachment

Relevant part of dmesg pointing to kernel BUG

<<>>
[14746.977364] kernel BUG at /build/linux-PABIrW/linux-4.15.0/mm/usercopy.c:72!
[14746.977377] illegal operation: 0001 ilc:1 [#1] SMP
[14746.977378] Modules linked in: macsec vsock_diag vsock sctp_diag sctp dccp_diag dccp tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding binfmt_misc qeth_l3 8021q garp mrp stp llc xt_tcpudp qeth_l2 nf_conntrack_ipv6 nf_defrag_ipv6 scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng ghash_s390 prng sha512_s390 sha256_s390 sha1_s390 sha_common chsc_sch eadm_sch nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32_vx_s390 qeth ccwgroup ip6table_filter ip6_tables vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio iptable_filter sch_fq_codel ip_tables x_tables aes_s390 des_s390 des_generic dm_crypt dm_service_time dm_multipath zfcp scsi_transport_fc qdio dasd_eckd_mod dasd_mod btrfs xor zstd_compress raid6_pq zlib_deflate
[14746.977401] CPU: 1 PID: 20905 Comm: dump2tar Tainted: G OE 4.15.0-36-generic #39-Ubuntu
[14746.977403] Hardware name: IBM 3906 M02 757 (LPAR)
[14746.977404] Krnl PSW : 000000000f2d230d 000000006abe14d5 (__check_object_size+0x15a/0x1e0)
[14746.977408] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[14746.977410] Krnl GPRS: 0000000000000002 0000000000e95334 0000000000000064 00000001e6518828
[14746.977412] 000000000037cc8e 0000000000000000 0000000000a9577c 0000000000000000
[14746.977413] 000000000000647b 00000001d8c120a8 0000000000000001 0000000000008088
[14746.977433] 00000001d8c0a020 000000000090da38 000000000037cc8e 000000016fdfbcd0
[14746.977440] Krnl Code: 000000000037cc82: c0200038ef69 larl %r2,a9ab54
                          000000000037cc88: c0e5fff32838 brasl %r14,1e1cf8
                         #000000000037cc8e: a7f40001 brc 15,37cc90
                         >000000000037cc92: e330d0080004 lg %r3,8(%r13)
                          000000000037cc98: e320d0000004 lg %r2,0(%r13)
                          000000000037cc9e: ecc2001a4065 clgrj %r12,%r2,4,37ccd2
                          000000000037cca4: b9040013 lgr %r1,%r3
                          000000000037cca8: ec31ff868064 cgrj %r3,%r1,8,37cbb4
[14746.977458] Call Trace:
[14746.977460] ([<000000000037cc8e>] __check_object_size+0x156/0x1e0)
[14746.977462] [<000000000010ac40>] debug_output+0x150/0x2f8
[14746.977464] [<00000000004e02c0>] full_proxy_read+0x80/0xe0
[14746.977466] [<0000000000382592>] vfs_read+0x8a/0x150
[14746.977467] [<0000000000382b2e>] SyS_read+0x66/0xe0
[14746.977469] [<00000000008e3c94>] system_call+0xd8/0x2c8
[14746.977470] Last Breaking-Event-Address:
[14746.977472] [<000000000037cc8e>] __check_object_size+0x156/0x1e0
[14746.977473]
<<>>

Adding one more occurrence of panic_on_oops below which appears to
correlate to above .

Stack trace output:
Available traces added below

Oops output:
 [ 2140.467261] 8021q: adding VLAN 0 to HW filter on device bond0
[ 2140.467979] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2140.471609] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2140.471610] 8021q: adding VLAN 0 to HW filter on device bond0
[ 2140.472797] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 2143.278986] Unable to handle kernel pointer dereference in virtual kernel address space
[ 2143.278991] Failing address: 7379732f6b657000 TEID: 7379732f6b657803
[ 2143.278993] Fault in home space mode while using kernel ASCE.
[ 2143.278996] AS:0000000000ea0007 R3:0000000000000024
[ 2143.279052] Oops: 0038 ilc:3 [#1] SMP
[ 2143.279055] Modules linked in: bonding 8021q garp mrp stp llc qeth_l3 binfmt_misc macsec vsock_diag vsock sctp_diag sctp dccp_diag dccp tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag xt_tcpudp qeth_l2 nf_conntrack_ipv6 nf_defrag_ipv6 scsi_dh_rdac scsi_dh_emc scsi_dh_alua nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32_vx_s390 ghash_s390 prng sha512_s390 sha256_s390 sha1_s390 sha_common chsc_sch eadm_sch ip6table_filter ip6_tables qeth ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio iptable_filter sch_fq_codel ip_tables x_tables aes_s390 des_s390 des_generic dm_crypt dm_service_time dm_multipath zfcp scsi_transport_fc qdio dasd_eckd_mod dasd_mod btrfs xor zstd_compress raid6_pq zlib_deflate
[ 2143.279099] CPU: 16 PID: 172270 Comm: dump2tar Tainted: G OE 4.15.0-36-generic #39-Ubuntu
[ 2143.279100] Hardware name: IBM 2964 NC9 7A5 (LPAR)
[ 2143.279102] Krnl PSW : 00000000d3630b5f 00000000af8614fc (debug_output+0x188/0x2f8)
[ 2143.279108] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 2143.279110] Krnl GPRS: 0000000000010000 ffffffff000002d8 7379732f6b65726e 00000001db91a020
[ 2143.279112] 0000000000000000 0000000000ea4ac8 00000001db91a020 00000000000009d2
[ 2143.279135] 0000000000000fe5 00000001000ff9ed 00000000000009d2 00000000000009d2
[ 2143.279137] 00000001db91a000 00000001db91a020 000000000010ac54 00000001d16cbd30
[ 2143.279146] Krnl Code: 000000000010ac68: 5810c010 l %r1,16(%r12)
                          000000000010ac6c: ec180063ff7e cij %r1,-1,8,10ad32
                         #000000000010ac72: e320c8280004 lg %r2,2088(%r12)
                         >000000000010ac78: e33020300002 ltg %r3,48(%r2)
                          000000000010ac7e: a784008f brc 8,10ad9c
                          000000000010ac82: 5a102028 a %r1,40(%r2)
                          000000000010ac86: 5010c010 st %r1,16(%r12)
                          000000000010ac8a: a7391000 lghi %r3,4096
[ 2143.279167] Call Trace:
[ 2143.279169] ([<000000000010ac40>] debug_output+0x150/0x2f8)
[ 2143.279172] [<00000000004e02c4>] full_proxy_read+0x84/0xe0
[ 2143.279175] [<0000000000382592>] vfs_read+0x8a/0x150
[ 2143.279177] [<0000000000382b2e>] SyS_read+0x66/0xe0
[ 2143.279180] [<00000000008e3c98>] system_call+0xdc/0x2c8
[ 2143.279182] Last Breaking-Event-Address:
[ 2143.279184] [<00000000008e7614>] __s390_indirect_jump_r14+0x0/0xc
[ 2143.279185]
[ 2143.279187] Kernel panic - not syncing: Fatal exception: panic_on_oops

System Dump Location:
 kdump was configured and crash dump is available. since crash dump is huge to be added as bugzilla attachment, results of few crash commands like bt, log will be added in Attachment

== Comment: #5 - Athira Rajeev
Hi,

since crash dump was huge to be added as bugzilla attachment, results
of few crash commands like bt, log were added in the Attachment.
Please let me know if required where to upload the dump files.

Thanks
Athira

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1797367/+subscriptions

Комментариев нет:

Отправить комментарий