Jetson TK1 USB 3.0 throughput

I have a HDMI-to-USB 3.0 frame capture device that can generate 1080p60 video in YUY2 colorspace. I wrote a simple application that dequeues a buffer from the v4l2 driver and then just returns it to the driver. The highest framerate that I measured was about 45 fps which I calculate to be about 1.5 Gbps throughput. Has anyone measured the USB 3.0 throughput on the Jetson TK1? I have enabled USB 3.0 on the board, disabled the USB auto-suspend, and cranked up the cpu scaler to max. I noticed the keyboard and mouse become quite unresponsive during the test, but top didn’t report excessive cpu utilization for the application or kernel threads. Perhaps the XHCI-tegra driver needs some work, or maybe the usb chipset firmware, or I’m just missing something? I have confirmed the frame capture device outputs full 1080p60 YUY2 video with the same sample application on an Intel NUC.

Here’s a thread which may be related to what you’re researching:
[url]https://devtalk.nvidia.com/default/topic/811034/?comment=4475113[/url]

The gist is that it seems xhci is capable of running at full speed, or at least RGB color 1920x1080 @ 60hz without dropping. Assuming actual cabling and HUBS are wired for taking advantage of USB3, along with what you’ve already done for keeping performance settings high, I’d start looking at other components or memory copy bottlenecks.

One of the first things to look at would be the details of any USB HUB you use. I really doubt there is any USB HUB these days without transaction translators (TT) on each port of a HUB, but if not, this could account for other USB peripherals (such as keyboard) becoming sluggish. What is the exact model of USB HUB used (or the specifications)? Are you sure all cables are correct for USB3? Which Jetson port is it connected to? Does lsusb -t show any lines ending “5000M”?

There may also be issues with how memory is being copied, especially if it copies between kernel space and user space. The amount of bandwidth involved with USB3 approaches the limits of a PCIe data lane for just a single connect. There is a good reason why most USB3 HUBs have only two ports…and if they have more than two ports, they tend to daisy chain them with performance penalties if more than one device on a HUB approaches any kind of USB3 speed. If your memory copy uses DMA versus CPU there would be a big difference. This latter is something I’d be suspicious of when other parts of the system (such as keyboard) start becoming sluggish during the operation.

Thanks for the reply linuxdev. I had seen that link and I didn’t understand what xlz meant by the term “depth stream”, but he mentioned a peak throughput of ~170 MBps which is actually a little less than my measurement (~190 MBps). My capture device is providing uncompressed 1080p60 video at 16 bpp, which actually requires about 250 MBps throughput - still well under USB 3.0 limits.

As to my USB connections, I am not using a hub between the capture device and the Jetson. The mouse and keyboard are plugged into a hub connected to the micro USB 2.0 port, and I have a direct connect USB 3.0 cable to the USB 3.0 port. I’ve tried two different cables before testing on the NUC with no issues, and therefore verifying the device and cables are capable of providing the proper throughput.

lsusb -t:
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=tegra-xhci/2p, 5000M
|__ Port 1: Dev 2, If 0, Class=Video, Driver=uvcvideo, 5000M
|__ Port 1: Dev 2, If 1, Class=Video, Driver=uvcvideo, 5000M
|__ Port 1: Dev 2, If 2, Class=Audio, Driver=snd-usb-audio, 5000M
|__ Port 1: Dev 2, If 3, Class=Audio, Driver=snd-usb-audio, 5000M
|__ Port 1: Dev 2, If 4, Class=Human Interface Device, Driver=, 5000M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=tegra-xhci/6p, 480M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=tegra-ehci/1p, 480M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
|__ Port 3: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
|__ Port 4: Dev 4, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
|__ Port 4: Dev 4, If 1, Class=Human Interface Device, Driver=usbhid, 1.5M

For fun I disconnected the mouse/keyboard hub after starting the test but I saw no difference in throughput. My test app performs mmap, no mem copies. It’s a very simple v4l2 buffer loopback, which I could post if anyone were interested. I’d be curious to know if xlz saw similar sluggishness on mouse and keyboard when pushing ~170 MBps across the USB 3.0 port. Have you or anyone else tested the throughput on the USB 3.0 port? I think we have two measurements now - mine at 190 MBps and xlz at 170 MBps.

I currently have nothing which is true USB3, especially nothing which challenges the speed. Something like an external USB3 hard drive enclosure would only help if the drives were insanely fast via RAID. Even an external USB3 gigabit network card would not challenge a USB3 port…the only use would be to go fast enough to verify that USB2 max speed has been exceeded. Any information I have is from third party reports.

One topic which has shown up before related to peripherals and latency/sluggishness is IRQ starvation. Any hardware device must be handled by a hardware IRQ to the first CPU, which means that even if there isn’t a lot of CPU load flooding of IRQ would cause a delay as each device competes for time on CPU0. I have no way of knowing if this is an issue for your situation, but sometimes if your memory operations cause fast repeated interrupts for any hardware driver, including USB and your camera in total, there could be a slowdown which is not truly a hardware throughput bottleneck. In such cases if you adjust by dealing with something like larger blocks of data less often (versus lots of smaller blocks more often), then the performance would go up dramatically (or at least the peripherals like keyboard would suddenly appear to be fluid again).

@gafinn - In working with the Kinect V2, there are several variables that I encountered with regards to USB 3.0. First, the Kinect produces three streams: Depth, IR and Color. The Depth and IR are 512x42, Color is 1920x1080. Together, they seem to reserve half the USB bandwidth of a 3.0 device.

USB devices allocate a certain amount of bandwidth when they’re attached. With the keyboard and mouse, that’s a small number, but if there are any other hungrier devices on the hub than your capture device, it could be an issue. Typically a USB device reserves the maximum bandwidth that it can use; i.e. a camera will reserve it’s highest resolution and frame rate even if it’s running at a lower resolution and frame rate.

On the Jetson, I believe that the requests for the USB ports are all handled through IRQs that are on core 0. That is, the IRQs are not split over the different CPU cores, they are only handled by one core. Therefore, if you have some computationally intense things going on and are not utilizing more than one core, that can slow down your USB processing.

Because the Jetson uses the 3.10.40 kernel, a lot of the fixes put into place in later kernels (3.16+) for USB 3.0 are not in place yet. This is readily apparent in the way that the isochronous max packet size works for 3.0 endpoints. I know that on the Kinect, you have to jigger libusb to set the max iso packet size so that the Jetson doesn’t lay down and choke on the default.

No answers, just observations.

Did you also force all cores online? Try also maximising GPU and especially the EMC clocks.

@Kangalow - I checked the reserved bandwidth with ‘lsusb -v’ and it appears sufficient for the two 1080p video descriptors with values for dwMaxBitRate at 1990656000 and 2211840000. I also checked the interrupt counts under /proc/interrupts and confirmed that tegra-xhci interrupts were only on CPU0, as you were suggesting. They didn’t seem excessive at about 180 interrupts per second for the tegra-xhci:usb2 listing. There isn’t anything else taxing the processor other than standard Ubuntu tasks. Thanks for the ideas.

@kulve - All cores are online. I bumped up the gpu rate with this:

echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state

But no difference in measured USB throughput. I wasn’t sure how to bump up the EMC so I went under /sys/kernel/debug/clock and maxed out all rates related to usb (including one named xusb.emc). Again, no difference.

It might be time for me to start digging through the kernel code some more. I’ve already removed the xhci-fw-log kernel thread which was consuming 10-20% of the processor during my testing. Removing the thread did gain me about 1-2 fps. I think I’ll look at the clocks next and see if I can play with them in the code at all. My real concern is if there is an issue in the USB chipset firmware then I’m stuck, unless anyone knows if the firmware code is available for us users to hack up.

Don’t forget that this IRQ count is cumulative among all hardware IRQ generation. The total available by default I believe is 1000/s (1000 Hz polling). So the sum of camera IRQ, USB IRQ, disk controller IRQ, NIC controller IRQ, camera IRQ, so on, must be sufficiently sparse that no two drivers are colliding; some drivers with buffer, such as disk drive, won’t be hurt much by service latency (how much buffer does your camera have? How much buffer does your camera kernel driver have?). Once the IRQ is issued the driver won’t be letting go for another device’s IRQ until it is done…so if it takes 100 ms to service the IRQ, you just lost 10% of your IRQ time. Realize that every time your USB communicates and uses an IRQ, that the device driver for whatever is connected to USB could possibly require more time before release of CPU0. How long do those IRQs take to service?

I tested an IDS UEYE USB3 4MP machine vision camera (not USB Vision, though), I got 90fps on the TK1 (360MB/s or 2.88Gb/s). I didn’t have any other USB device on the board.

Thanks for the data point rdong. It’s good to know that the hardware is capable of higher rates than what I’m currently seeing.

What version of the linux kernel are you using - is it the standard 3.10.40 that’s part of L4T 21.3?

I noticed on the IDS support site that they provide a custom driver for their USB3 cameras. Did you use their custom driver or the standard V4L2 driver provided by the kernel?

I’m still on 3.10.24-grinch-19.3.6 (I need to update it). I was using IDS’ custom driver and I’ll test again when the USB3 Vision cameras are out.

@gafinn I did have a similar issue with IRQ starvation, except mine wasnt vision/USB3.0 related… I was experiencing something similar over usb-serial converter which i had hooked up to roboteq motor controllers, communicating to it over a serial port driver i wrote. Just thought i’d throw my 2 cents in the thread. Im not sure where to start to try and figure out this problem regarding all IRQ’s being on CPU0, but id be glad to try and help with some kernel work, but like i said, havent done much of that up till now.