Sorry, this is really long :P
I’m just adding some observations…not necessarily anything in particular or in any order. Not all tests seem consistent. There is no real conclusion in this, but there are a lot of tests which may be surprising. It shows a few cases with the obvious possibilities are not the problem.
My TX1 is running a fully updated L4T R28.1.
I am running as root with “sudo -s”.
Some of my testing leads to questions I can’t answer. As a very basic test I tried to see more basic results…I ran a flood ping while jetson_clocks.sh was at full speed (“sudo ping -f whereever”) for 30 seconds from host to Jetson, and then from Jetson to host. Both resulted in no loss and right around 50000 packets. This does not involve TCP or UDP (it’s ICMP) and is more closely related to the physical layer (and ARP) working correctly (without this being correct TCP and UDP would both inherit a faulty environment). No error, drop, overrun, collision, etc., ever occurs from flood ping. This tends to place issues with the higher level protocol stacks (hardware drivers work in lower levels on CPU0, software drivers implement stacks on any CPU core…stacks are limited by throughput of data feeding them or being consumed).
I see that the “-b 1G” argument to iperf3 is not actually listed as valid in the man page, but “iperf3 --help” does show this. I tried iperf3 with “-b 1X” just to see if it showed an error, and it does not (I consider it a bug that an invalid argument is not an error). This calls into question whether the 1G bitrate is really doing as expected. I don’t have a network analyzer so I couldn’t say. Probably 1G is supported…but then again, perhaps it is supported just on arm64 or just on x86_64. I don’t know.
So I did something not yet done in order to isolate where limitations are coming from. I ran iperf3 as both server and client on the TX1 (which also implies the 1G speed is guaranteed to be the same for behavior of both client or server mode…there is no arm64/x86_64 difference possible). I used localhost address 127.0.0.1. This avoids going through the Realtek driver and hardware. This still uses protocol stacks (keep in mind ping doesn’t care much about protocol stacks, UDP and TCP do). I had a throughput of approximately 999Mbits/sec, with no retries (this verifies “-b 1G” works as expected). This loopback interface can bypass CPU0 since no hardware drivers are involved. I’d say protocol stacks (iperf3 is using both UDP and TCP) and purely software side is at full performance (at least when limited to 1G speed…cutting out hardware seems to reach theoretical maximum). I tend to favor saying there is an issue with either the Realtek driver or the time the Realtek driver is having available for running (the latency before a hardware IRQ begins service or the time used during service of the IRQ is implied as the limitation). It’s hard to know without profiling, and I have no way to hardware profile.
I then ran a flood ping on the TX1 to 127.0.0.1 for 30 seconds (a ping going to itself without touching the NIC). Approximately 900000 packets were serviced without loss. The network software, when not going through network hardware, is about 18:1 faster.
Next I ran a 30 second flood ping to the address of the local NIC on the TX1 (both send and receive are serviced by the same NIC and driver…for me this is 192.168.2.30). I actually got about 910000 packets…slightly better throughput…and this is without jetson_clocks.sh. With jetson_clocks.sh the throughput did not seem to change. Unless network software is doing something smart and not actually routing through the NIC hardware (and it might actually be that smart, I don’t know) this also implies that the driver, when running, does what it should if not talking to the protocol stacks (I don’t consider the work of ICMP significant enough to compare to a TCP stack). Perhaps it is the throughput between Realtek driver and protocol stack which is bottlenecked.
Each of the following are between x86_64 host and TX1:
Here are some client/server side commands used (I reverse which side’s address is involved if I reverse roles):
sudo iperf3 -c 192.168.2.2 -p 12345 -t 60 -i 10 -b 1G -R
sudo iperf3 -s -p 12345
No jetson_clocks.sh, server on host:
I see no retries and roughly 492 Mbits/sec.
With jetson_clocks.sh, server on host:
I see no retries and roughly 492 Mbits/sec.
Implies: jetson_clocks.sh makes no difference on speed, and no retries needed either way.
No jetson_clocks.sh, server on TX1:
I see lots of retries, and roughly 639Mbits/sec.
With jetson_clocks.sh, server on TX1:
I see lots of retries, and roughly 652Mbits/sec.
Implies: Marginal throughput improvement. Retries did not significantly change.
ifconfig errors:
When testing is done I see no errors, drops, overruns, etc., on the TX1 side. I see a very large number of dropped RX packets on the host, but no outright errors.
Note that a dropped packet is correct behavior for UDP during congestion, or just from sending faster than the packets can be used (this isn’t a software error per se, but it is a weak link in the chain if something is bottlenecking). TCP also can have dropped packets, but it would retry. It doesn’t mean something isn’t wrong, but it does mean that within its abilities the network is behaving as it should if the retries were a case of congestion. iperf3 is essentially trying to congest the network and measure congestion.
I rebooted the TX1 and re-ran both client and server sides while monitoring ifconfig. No jetson_clocks.sh was used. I got no drops on the TX1.
I used jetson_clocks.sh on the TX1. I re-ran (no reboot) both client and server side again on the TX1. Still, the TX1 does not show any drops. Apparently it is only the host side which is seeing RX drops.
To bring things together from the real world I decided to copy data over the network via netcat. This will simply send as fast as it can and receive as fast as it can. I didn’t want to depend on disk read speed or write speed, so I’m using other sources and destinations.
If you run this command it will read the rootfs partition and redirect it to “/dev/null” and show a time measurement:
# time dd if=/dev/mmcblk0p1 bs=512 > /dev/null
29859840+0 records in
29859840+0 records out
15288238080 bytes (15 GB, 14 GiB) copied, 70.6154 s, <b>216 MB/s</b>
real 1m10.619s
user 0m6.728s
sys 0m28.024s
…216 MB/s (1728Mbit/s). This exceeds gigabit. The important thing is to know this contains 15288238080 bytes.
To simplify this:
# time cat /dev/mmcblk0p1 > /dev/null
real 1m9.582s
user 0m0.024s
sys 0m8.576s
…cutting out dd shows 219715416 bytes/s, or 209.5MB/s (1676Mbit/s…also exceeding gigabit). So we know however we read the raw mmcblk0p1 we get enough throughput to exceed gigabit.
To use netcat read from port 12345 I do this (it saves into “/dev/null”…in other words, it just discards bytes):
nc -p 12345 -l > /dev/null
…restart this after each send completes.
To send mmcblk0p1 over port 12345 without touching the Realtek NIC I use:
# time nc -q 0 127.0.0.1 12345 < /dev/mmcblk0p1
real 1m10.492s
user 0m3.196s
sys 0m38.904s
…this is only one second longer than without netcat. Everything associated with networking, when purely in software, is quite good.
Now lets do this again over the NIC (192.168.2.30 for me is the NIC):
# time nc -q 0 192.168.2.30 12345 < /dev/mmcblk0p1
real 1m10.976s
user 0m2.960s
sys 0m39.224s
…this appears to have almost no overhead when running through the NIC. Once again though, I do not know if the kernel is optimizing when it knows it is local traffic.
So I’ll do this between host and TX1 where I send from TX1 to host (adjust addresses and where the listener runs as required):
# time nc -q 0 192.168.2.2 12345 < /dev/mmcblk0p1
real 3m16.854s
user 0m2.912s
sys 0m47.676s
…clearly, talking to the outside world has a dramatic penalty even when the two are directly connected on the same switch. The actual throughput here is approximately 77684136 bytes/s (around 593Mbit/s).
It happens that there is another reason why I used mmcblk0p1. My host already has this file on it as the system.img.raw. So I can copy the same number of bytes back to the Jetson in the reverse process. Keep in mind that the first time you read a file on a system with lots of RAM it may cache it, and the second time would be faster. Regardless, the rate with or without cache will far exceed gigabit, so it should be a good repeatable test.
So I run the listen on the Jetson this time, and send system.img.raw to the Jetson this time (I don’t use the “-q 0” on host because it is Fedora and doesn’t use this):
# time nc 192.168.2.30 12345 < system.img.raw
real 4m7.859s
user 0m9.581s
sys 0m52.392s
…clearly the TX1 receives slower than it sends when there is a remote host involved. The loss of throughput is real. The problem is that when doing all of this directly on the Jetson the same loss of throughput is not seen. The problem isn’t with the Realtek driver, nor with the TCP stack, nor is it with how the driver is running. Something else is getting in the way and is perhaps an interaction between two parts of the software which does not show up on individual software or driver tests. An example is that additional ARP and negotiations go on between a remote host versus localhost or the NIC on the local machine.
In no case did the ifconfig on eth0 of the Jetson ever show any drops or errors of any kind. I suspect the previously seen drops were from UDP. So now I’ll force UDP.
Listing on the TX1:
# nc -p 12345 -l -u > /dev/null
Sending on the host to the TX1:
# time nc -u 192.168.2.30 12345 < system.img.raw
real 3m44.611s
user 0m8.875s
sys 0m49.094s
…this works out to about 64.9MB/s, or 519Mbit/s. Sending from host to Jetson is slower than the other direction, but it isn’t as dramatic as what shows up under iperf3.
I believe someone needs to throw a network analyzer between the outside host and Jetson and run either netcat or iperf3 to see where any inefficiencies are. It gets too complicated without this and there is no clear single cause. Perhaps it is something simple like MTU/MRU behavior or an interaction from two things occurring simultaneously being an issue, yet not being an issue one at a time.