воскресенье

[Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Terry,

We've had a lot of discussion over this bug. It does not have
a reliable reproducer, and I have not yet received any acks
on testing of the above.

Our thinking was that it was still better to patch it since
it has been seen by the mainline driver as well and we'd like
to avoid a re-occurrence of the situation.

The need is to have the fix be available in the Xenial official
bits, for sure (rather than providing a temporary test kernel
via our ppa or something, for instance).

FWIW, here are the boards in question:
enum board_idx {
BCM57301,
BCM57417_NPAR,
BCM58700,
BCM57311,
BCM57312,
BCM57402,
BCM57402_NPAR,
BCM57407,
BCM57412,
BCM57414,
BCM57416,
BCM57417,
BCM57412_NPAR,
BCM57314,
BCM57417_SFP,
BCM57416_SFP,
BCM57404_NPAR,
BCM57406_NPAR,
BCM57407_SFP,
BCM57407_NPAR,
BCM57414_NPAR,
BCM57416_NPAR,
BCM57452,
BCM57454,
NETXTREME_E_VF,
NETXTREME_C_VF,
};

Per conversation with Brad and Jay, it was agreed that patching
the bnxt_en_bpo driver only with this fix was the way to go,
despite the lack of a reproducer, rather than pulling in an
entire new driver from Broadcom as also potentially mulled over.


The FW version the issue was hit on:
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03

But it might be best to test with latest available
firmware (214.0.166/1.9.2 pkg 21.40.16.6 or later).

Not sure if that helps? Let me know if I can address anything
else.

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/1814095

Title:
bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
Confirmed
Status in linux source package in Xenial:
In Progress

Bug description:
[Impact]

The bnxt_en_bpo driver experienced tx timeouts causing the system to
experience network stalls and fail to send data and heartbeat packets.

The following 25Gb Broadcom NIC error was seen on Xenial
running the 4.4.0-141-generic kernel on an amd64 host
seeing moderate-heavy network traffic (just once):

* The bnxt_en_po driver froze on a "TX timed out" error
  and triggered the Netdev Watchdog timer under load.

* From kernel log:
  "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
  See attached kern.log excerpt file for full excerpt of error log.

* Release = Xenial
  Kernel = 4.4.0-141-generic #167
  eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

* This caused the driver to reset in order to recover:

  "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting
reset task!"

  driver: bnxt_en_bpo
  version: 1.8.1
  source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

* The loss of connectivity and softirq stall caused other failures
  on the system.

* The bnxt_en_po driver is the imported Broadcom driver
  pulled in to support newer Broadcom HW (specific boards)
  while the bnx_en module continues to support the older
  HW. The current Linux upstream driver does not compile
  easily with the 4.4 kernel (too many changes).

* This upstream and bnxt_en driver fix is a likely solution:
   "bnxt_en: Fix TX timeout during netpoll"
   commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

  This fix has not been applied to the bnxt_en_po driver
  version, but review of the code indicates that it is
  susceptible to the bug, and the fix would be reasonable.

[Test Case]

* Unfortunately, this is not easy to reproduce. Also, it is only seen
on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
driver.

[Regression Potential]

* The patch is restricted to the bpo driver, with very constrained
scope - just the newest Broadcom NICs being used by the Xenial 4.4
kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
in-tree fixed driver).

* The patch is very small and backport is fairly minimal and simple.

* The fix has been running on the in-tree driver in upstream mainline
as well as the Ubuntu Linux in-tree driver, although the Broadcom
driver has a lot of lower level code that is different, this piece is
still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

Комментариев нет:

Отправить комментарий