среда

[Bug 1772675] Re: Intel i40e PF reset due to incorrect MDD detection (continues...again...)

@terryh-orcas,

if you are able to reproduce the problem relatively quickly and easily, then I suggest testing different kernel versions, up to the latest upstream, to see if and where it may be fixed with a newer i40e kernel driver. You can get upstream kernel debs here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D

If you can narrow down the kernel to a specific short range (i.e. kernel
X definitely fails, kernel Y never fails), I can review the upstream
i40e driver for specific changes to backport.

If you can't reproduce it easily/quickly, there is another method of
debug involving undocumented i40e register modification. See bug
1723127 comment 10 for details. If you try that method, you should
attempt it with the latest kernel you can reproduce the problem with.
As I don't have the chipset specifications, if you do reproduce it this
way and can isolate the problem to a specific register/bit, I'll have to
take that info back to Intel to ask them for clarification. Also note
that there are 2 registers that you have to test each bit individually
for, so this method can take a very long time if it takes you a long
time to reproduce the problem.

Unfortunately, as has been mentioned in this and past bugs, the MDD
event is generated by the i40e firmware and there is no documented way
to tell what the i40e kernel driver did that the firmware didn't like
(assuming it was something the driver did, and not external or firmware
issues). Intel does update their upstream i40e driver with fixes for
MDD firmware/driver bugs regularly, so this will likely only be fixed by
a patch coming from Intel upstream, that we need to backport to our
older stable Ubuntu kernel(s).

Sorry I can't help more.

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/1772675

Title:
Intel i40e PF reset due to incorrect MDD detection
(continues...again...)

Status in linux package in Ubuntu:
Incomplete
Status in linux source package in Xenial:
Incomplete
Status in linux source package in Bionic:
Incomplete
Status in linux source package in Cosmic:
Incomplete

Bug description:
[impact]

The i40e driver sometimes causes a "malicious device" event that the
firmware detects, which causes the firmware to reset the nic, causing
an interruption in the network connection - which can cause further
problems, e.g. if the interface is in a bond; the reset will at least
cause a temporary interruption in network traffic.

[fix]

The fix for this is currently unknown. As the "MDD event" is
generated by the i40e firmware, and is completely undocumented, there
is no way to tell what the i40e driver did to cause the MDD event.

[test case]

the bug is unfortunately very difficult to reproduce, but as shown in
this (and previous) bug comments, some users of the i40e have traffic
that can consistently reproduce the problem (although usually on the
order of days, or longer, to reproduce). Reproducing is easily
detected, as the nw traffic will be interrupted and the system logs
will contain a message like:

i40e 0000:02:00.1: TX driver issue detected, PF reset issued

[regression potential]

unknown since the specific fix is unknown.

[original description]

This is a continuation from bug 1713553 and then bug 1723127; a patch
was added in the first bug and then the second bug, to attempt to fix
this, and it may have helped reduce the issue but appears not to have
fixed it, based on more reports.

See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

Комментариев нет:

Отправить комментарий