How to get 1080p h.264 encoding with 60 fps on Jetson TK1

I have a NVIDIA Jetson Tegra K1 with the current kernel R21.3. I capture 1080p with 60 fps with a Basler USB3 Camera and feed this data per appsrc in a GStreamer 1.0 pipeline. The pipeline uses the omxh264enc to encode the video as h.264. But it is surprisingly slow, I only get around 40 fps! With higher framerates the system is dropping frames because the encoder isn’t fast enough.

I thought 60 or at least 50 fps should be possible with the K1. But I didn’t found congret numbers. Is there something I need to do, to use the 60 fps?

And how can I configure the h.264 profile? I always get profile/level not supported!

I have tested it with dataset from: (1080p videos)
ftp://vqeg.its.bldrdoc.gov/HDTV/NTIA_source/

I have the Grinch kernel 21.3.4 with maximized CPU performance (Jetson/Performance - eLinux.org)

Max I have is 23-25fps @ 1080p , 55-58fps @ 720p and 108-148 @ 480p.
Example pipeline for 1080p:
gst-launch-1.0 filesrc location=Aspen_8bit.avi ! avidemux ! videoconvert ! progressreport ! omxh264enc ! matroskamux ! filesink location=Aspen264.mkv

To scale video to 720p:
gst-launch-1.0 filesrc location=Aspen_8bit.avi ! avidemux ! videoconvert ! videoscale ! video/x-raw,width=1280,height=720 ! avimux ! filesink location=Aspen720.avi

To get available parameters for the encoder:
gst-inspect-1.0 omxh264enc

In the software encoder (x264enc) you can set the speed preset to 1 (ultrafast), but it will result in bad quality output.
gst-launch-1.0 filesrc location=Aspen_8bit.avi ! avidemux ! videoconvert ! progressreport ! x264enc speed-preset=1 ! matroskamux ! filesink location=Aspen264.mkv

videoconvert and videoscale are both using SW only and thus very very slow.

What pipeline exactly are you using? Did you check the examples in the guide:

http://developer.download.nvidia.com/embedded/L4T/r21_Release_v3.0/L4T_Jetson_TK1_Multimedia_User_Guide_V2.1.pdf

I capture raw Bayer-Images from a Basler Ace camera and then convert this with a custom ARM NEON optimized BayerBG-to-I420 algorithm. This I420 image is feed through an appsrc into a pipeline with the omxh264enc, a matroskamux and a filesink. With 40 fps this seams to work, but if I try to do 50 fps it is working some time and then the memory usage increased and I get a stop signal for the feeding. So it seams the encoder is not fast enough. I read somewhere the Tegra K1 should be able to encoder 2x1080p30. So does this mean I can’t do 1x1080p60? Do I need two encoders or can one omxh264enc do 60 fps? Or at least 50 fps!

How many milliseconds it takes for your ARM NEON optimised algorithm to do the BayerBG-to-I420 conversion?

6-7 ms, but this seems not to be the problem, becaus if I use a copy algorithm (BayerBG as Y and fill UV with 128) it is the same.

OK I have tested the pipeline again. With FullHD and 44 FPS the encoder is fast enough, but with 45 or more the memory usage increases until the appsrc buffer is full and it sends the enough-data signal.

I have tested a similar pipeline with gst-launch and it is seems to work correct. No memory-increasing and the encoding speed is fast then realtime. With:

gst-launch-1.0 videotestsrc horizontal-speed=-8 num-buffers=3000 ! video/x-raw, format=I420, framerate=50/1, width=1920, height=1080 ! omxh264enc ! fakesink

I get an execution time of 0:00:50.212705629. That is faster then the 0:01:00.0 of the realtime. So why doesn’t this work in my application. Is this a problem with the encoder or with the appsrc-Element and my pipeline?

Hi,

I’m sorry I don’t have an answer for you about this. But I’m extremely curious how you’re able to encode from an application? I’m also using a basler USB3.0 camera. Can you show some code snippets on howto perform this encoding using gstreamer?

Kind regards,
Error323

Sorry my application is to complex to show some code here. But the basic concept is to grab the frames with the Basler-Pylon SDK, convert this frame to yuv and feed it with an “appsrc” to a gstreamer pipeline to write or send it out. The Pylon-SDK has examples for grabbing frames and there are also a lot exmaples on the net how to use gstreamer with “appsrc”. (e.g. http://gstreamer.freedesktop.org/data/doc/gstreamer/head/manual/html/section-data-spoof.html)

#cplussharp
hi, I tried to do this on my side. I am using 21.4 BSP release (available for downloading)
as i don’t have USB3 camera, I created a raw file using the following command :-
gst-launch-1.0 videotestsrc num-buffers=1000 ! ‘video/x-raw, format=(string)I420, width=(int)640, height=(int)480, framerate=(fraction)30/1’ ! filesink location=test_480p.yuv -e

then i used the file thus created in your command :-
$gst-launch-1.0 filesrc location=test_480p.yuv ! videoparse width=1980 height=1080 format=2 framerate=60 ! videoconvert ! progressreport ! omxh264enc ! matroskamux ! filesink location=aspen264_1.mkv
Setting pipeline to PAUSED …
Inside NvxLiteH264DecoderLowLatencyInitNvxLiteH264DecoderLowLatencyInit set DPB and MjstreamingPipeline is PREROLLING …
Framerate set to : 60 at NvxVideoEncoderSetParameterNvMMLiteOpen : Block : BlockType = 4
===== MSENC =====
NvMMLiteBlockCreate : Block : BlockType = 4
===== MSENC blits (mode: 1) into tiled surfaces =====
Pipeline is PREROLLED …
Setting pipeline to PLAYING …
New clock: GstSystemClock
progressreport0 (00:00:03): 2 / 2 seconds (100.0 %)
Got EOS from element “pipeline0”.
Execution ended after 0:00:03.758976224
Setting pipeline to PAUSED …
Setting pipeline to READY …
Setting pipeline to NULL …
Freeing pipeline …

checked the mediainfo for output file and it seems ok.
Can you please repeat your experiment on 21.4 BSP release and share your result.

Manoj, you are not really answering the question that was presented here…

cplussharp, did you make progress on that? You did mention than you convert something to YUV. That is already a very slow operation, especially if you do that on the CPU. The gst-launch video playback example that you showed, is a very straightforward operation using only the video decoder HW with a minimum of copying memory around. The 6-7 ms you mentioned is already about half of the whole time and that’s completely extra time (and bandwidth usage) compared to the plain gst-launch encoder pipeline.

I am not sure how user application is programmed and working, can’t comment on that as of now, but I was trying to clarify the initial concern that 60fps is not achievable, by my example I am able to use yuv raw data and can achieve 60fps.

Thanks for your comments. @Kulve you were right. The problem is the time needed for the conversion. I had replaced my conversion algorithm with a straight memcpy(Bayer as Y-Plane) + memset(UV-Plane as 128) but I did overlook, that this also needs 3-6ms (which I don’t understand?). So if I feed the encoder direct with prepared and preprocessed frames (without any copy), I can reach 50+ FPS.

But sadly I need to take the image frames from the camera and convert them. As my Basler camera only produces BayerBG. I thought the Video-Encoder is a seperate part of the processor, working in parallel to the ARM-Cores, so that I can do some work on the CPU and run the encoding in parallel.

How is this with the cude cores. If I can move my algorithm to the cuda cores, are they in parallel to the encoder or does this also affect the encoder performance? And how can I work with the pinned memory (zero-copy) faster? Copy operrations from/to this memory are very slow, even slower then my current conversion algorithm.

Hi again,

I think I can help you here. For some reason (which I didn’t look into) the memcpy for gcc is very slow, you could try the clang-3.6 compiler (iff you’re not compiling cuda yourself directly) it reduces the time by atleast an order of magnitude.

Here is a nice tiny tutorial on zero copy for the tk1. Like you I also have a basler camera (4600-10uc) and created a demosaic algorithm in cuda which does a zerocopy demosaic for a 4608x3288 bayer input. I’m applying 3 convolution filters (r,g,b) and scale the image down to 1/2 in 26 ms. I might be able to discuss if I can release the source if people are interested. Also I’d be interested if more speedups could be gained. As usual it’s main bottleneck is bandwidth.

Cheers,
Error323

CPU cores, video encoder and the GPU (cuda cores) all work in parallel. But they all share the same memory and the memory bus. I’m not an expert in optimisations but I’m assuming you are memory bandwidth limited and thus need to optimise your memory accesses. One way to do that is the optimise the conversion algorithm so that it can benefit as much as possible from the L2 cache and use less the external memory. But I’m not able to give you any suggestions how to actually do that.

Hello everyone.

Despite of all the answers there was no solution specified for the main issue of the topic - gstreamer stops calling “need-data” callback.

I’ve got an application which streams image data via appsrc to gstreamer pipeline. Pipeline is built according to the official NVidia examples from the Multilmedia user guide.

After several calls to my “need-data” callback, GStreamer stops calling it.

Any ideas would be very precious.

Thanks for the help, now I get 55fps with 1080p and 100fps with 720p. There where multiple problems in my code. I have optimized my memory usage, now I use a custom memory allocator for the basler pylon camer driver and I use alligned memory. And my biggest, and dumbest, problem, now I test the release build of the program and not the debug build! This means leas log output and the code is optimized by the compiler.
And I have changed my BayerBG to I420 algorythm, I’m usung the green value now as Y. The colors are not 100% correct now but it looks still good and the algorythm needs 5ms for a 1080p frame now.

@sergeyk789 appsrc stops calling “need-data” if its internal buffer is full, have you tried setting a higher value for the “max-bytes” property of the appsrc

Hello C+sharp,

With Jetson TX1, would it be possible to encode 1080p60fps video (source format is YUV422)? What do you think maximum delay could be?

Thanks,

Hi samsangani,

I’m still waiting for my TX1 DevKit, so I have only the information from the datasheets. The TX1 should be able to encode 1080p120fps with h.264. I think the input for the encoder is the same as in the TK1, this means I420. So you need to convert your YUV422 to I420, but this is mainly rearanging th bytes you have from an interleaved format to a planar format.

The delay not only depends on the encoder, but also on your camera, the network-connection and the decoding hardware/software.

Rearanging the bytes (~ 2-3ms) + copy memory in (~ 1ms) + encoding (~ 10-15ms) + copy memory out (~ 1ms) + network transport (> 5ms) => so I would estimate the delay around 25-30ms + the delay on the decoding site.

You forgot to take into account the exposure time of the camera. Say for example, the camera streams 1080p@120 fps which can be done within 8.33ms. So the worst case delay compared to real-time would be 8ms in addition to your calculation.

If the camera streams 1080p@60fps which can be done with a maximum exposure time of 16.67 ms. So the maximum delay in that case would be 17 ms in addition to all the other processing.