[Pegasue + DRIVE OS v5.1.0.0-13431798] Network interface disabled when massive data transferring loc...

Hi All,
I put massive data transferring locally (via 10.42.0.28/29 or external switch like 192.168.x.x) between TegraA and TegraB and suffered all network interface disabled randomly and cannot be recovered (i.e. need to reboot kernel).


[ 23.336663] eqos 2490000.ether_qos eth0: Link is Up - 1Gbps/Full - flow control off
[ 23.336728] br0: port 1(eth0) entered blocking state
[ 23.336736] br0: port 1(eth0) entered forwarding state
[ 23.337075] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready

[ 25.021042] br0: port 2(hv0) entered blocking state
[ 25.021048] br0: port 2(hv0) entered disabled state
[ 25.072508] device hv0 entered promiscuous mode
[ 25.074171] br0: port 2(hv0) entered blocking state
[ 25.074179] br0: port 2(hv0) entered forwarding state
[ 25.135601] br0: port 3(hv1) entered blocking state
[ 25.135608] br0: port 3(hv1) entered disabled state
[ 25.189785] device hv1 entered promiscuous mode
[ 25.191737] br0: port 3(hv1) entered blocking state
[ 25.191742] br0: port 3(hv1) entered forwarding state

[ 738.735008] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0
[ 738.737992] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0
[ 738.741253] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0
network_crash.dmesg.log (77.3 KB)
network_crash.syslog.log (983 KB)

@SteveNV and @SivaRamaKrishna,

Because it stops our development for using TegraA and TegraB simultaneously due to its network disconnection, then crease nv-online issue #2591049 for sighting. I will try to find more test-case to duplicate issue for your easily understanding. Sorry for it!

Thanks!

Gary

Dear Garywang,

Thank you for your filing a bug.
We will check it and update via the bug. Thanks.

Hi garywang, Could you provide the steps to reproduce this issue? Thanks!

Hi,

I am facing the exact same symptom with Xavier.
Is there any progress on it?
Thanks

Dear yutaka.takaoka,

Could you please let us know below info and provide the steps to reproduce this issue? Thanks.

Drive Software version :
Aurix FW version :

Repro step :

Got the same problem on Xavier and google bring me here. Indeed the problem happens on heavy network load. Don’t think that will be easy to reproduce it without mirroring the whole setup (3d scanner with more than hundred network cameras). Did someone have idea for a fix?

Dear atanas,
Are you able to reproduce this issue. Could you share the steps.

Yes, the problem is reproducible on my setup but takes too long and assume that is triggered by our usage i.e. difficult to share any particular steps that you could do to reproduce it.

Seems that the problem is a bug in ethernet driver - eqos_start_xmit somehow freeze the network.
Our application also sometimes consumes too much memory which could trigger the bug too.

Fortunately:

  1. We have a workaround - we already added PCI network adapter with 4x1 gigabit ports and will nod use the build in for our project until the issue is properly fixed.

  2. I’m able to debug that myself or help you debugging that issue on our setup.

  3. Fortunately L4T kernel was setup with CONFIG_DYNAMIC_DEBUG that allow me to take more debug messages in kernel log that could help.

Didn’t have time to dig really deep. So far I know that the code that prints that

"eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0"

message seems correct. It detects that the ring buffer is full and raises the error. Probably the problem is somewhere when the error is handled from afterwards. Would like to be able to print the stack trace when it happen but don’t have the time to build a custom kernel right now. Something the could put more light is this log that just got on my Xavier box:

...
64 bytes from 8.8.8.8: icmp_seq=249 ttl=45 time=25.5 ms
64 bytes from 8.8.8.8: icmp_seq=250 ttl=45 time=29.1 ms
64 bytes from 8.8.8.8: icmp_seq=251 ttl=45 time=25.7 ms
64 bytes from 8.8.8.8: icmp_seq=252 ttl=45 time=25.7 ms
64 bytes from 8.8.8.8: icmp_seq=253 ttl=45 time=25.6 ms
[73382.985026] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
...

Dear atnas,
Could you share the Drive OS version?

We don’t use Nvidia Drive - our product is a 3D scanner for the fashion industry ( www.treedys.com ) - and we are releasing new version of the scanner control hardware based on Jetson Xavier. So far this is the only issue that we didn’t solve - as we have workaround the priority is not high for us but will be good if we could solve it to simplify the system and use the build in Xavier ethernet interface.

Hope that this will give you enough information about the setup:

nvidia@scanner:~$ uname -a
Linux scanner 4.9.140-tegra #2 SMP PREEMPT Thu Sep 5 13:12:33 CEST 2019 aarch64 aarch64 aarch64 GNU/Linux

The system was set up with SDK Manager:

Currently installed version: 0.9.14 (rev.1) - 4961
JetPack 4.2.2 (rev.1)

The issue is easily reproducible if you install a web server and try to download several 1GB zip files at the same time from it. I.e. it happen when the interface sends data not when it downloads. In our usecase we collect hundreds of 4k pictures from IP cameras connected via additional PCIe ethernet board with 4x1000 ports and then downloading them in one huge .zip file (without compression, just storring in one file) - the issue happen nearly every time.

From brief source code overview for the reasons to get full ring buffer seems that the error is correctly raised - then somewhere in the upper layers something get wrong and freeze the ability to send data on that interface. Switching the interface down and up unfreezes it without problem.

As said - I’m able to help you to debug the issue on my setup if you cannot reproduce it but don’t have the time to dig deeply in the network stack to figure out myself why it happen.

Dear atnas,
Could you please post your issue in Jeston forum. This forum is intended for Drive platform

No problem!

Blame google for pointing me here when I was looking for the symptoms.