TX1: UDP send severely CPU-limited?

Hi

I’m wondering if anyone else has seen this: the board can receive UDP packets at a respectable rate (800Mbps+) but sending data performs poorly. With 1472 byte packets, netperf reports less than 200 Mbps, and netperf uses 100% of CPU.

$ netperf -4 -l 60 -t UDP_STREAM -H 192.168.7.16 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 () port 0 AF_INET to 192.168.7.16 () port 0 AF_INET : demo
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00      903491      0     177.32
212992           60.00      903491            177.32

If my maths is right, that’s only about 15000 packets per second (and the bandwidth scales with packet size, so it seems to be a per-packet-overhead issue rather than a bandwidth issue). My laptop can sustain 40x that - I expect TX1 to be slower, but not that much slower, and the asymmetry seems particularly odd (normally my experience is that receive is harder than send).

The CPUs are at max clock using the script provided in the main TX1 thread:

$ for i in 0 1 2 3 ; do     echo "CPU${i}: `cat /sys/devices/system/cpu/cpu$i/cpufreq/scaling_cur_freq`"; done
CPU0: 1912500
CPU1: 1912500
CPU2: 1912500
CPU3: 1912500

If I do a TCP test with a large message size, performance is good, presumably because the NIC is doing TCP segmentation offload.

It looks like the gigabit on TK1 is via a separate Realtek RTL8111GS (on the PM375 board, not the SoC) using an x1 PCIe lane (see "lspci -s ‘01:00.0’ -v). The driver is the r8169. I’m wondering if the bottleneck is the driver itself…if so, it might have room for optimizing. On the other hand, the netperf tool itself might be at issue. You could try launching the test with increased priority to the netperf tool, where “-1” should be the sweet spot (“man nice”).

EDIT: One thing I notice is it appears this chip (or PCIe lane) can only operate at 2.5GT/s, which is Gen. 1 speed…yet the PCIe controller is capable of Gen. 2 speeds (5GT/s). Technically, throughput of a gigabit network could not challenge even a PCIe x1 lane, but the trick there is that the PCIe controller needs to spend more time in the driver dealing with the PCIe lane…the more PCIe activity there is, the more the time in driver handling of PCIe gets in the way (at Gen. 2 it could spend half the time running a PCIe operation, verus Gen. 1).

What is the network topology…does the network traffic involve a router at all, or is this direct between two machines (other than a switch)?