Do you have code sample or advice how to link modules with NVMM memory?
If NVMM is closed format used in NVIDIA only modules, does other way exists to access nvvidconv results without making a memory copy operation?
NVIDIA has a lot of good Gstreamer plugins, optimized for GPU.
We also have own Gstreamer plugins for image stabilization, sensor fusion (fusion of daylight camera, and night vision camera), neural network based object detection etc. All our modules are also optimized for GPU, and they work amazingly fast.
The problem - how to use them together, without copying the memory. Because each time we have to make memory copies, and it ruins all advantages of NVIDIA Gstreamer plugins. At the beginning of NVIDIA plugins CPU memory is copied to CUDA memory, processed and then copied back to CPU. After that we are making the same thing: we are taking memory from NVIDIA Gstreamer plugin (CPU memory) and allocating Cuda Managed memory and making a copy.
These questions still remain relevant:
Do you have code sample or advice how to link modules with NVMM memory?
If NVMM is closed format used in NVIDIA only modules, does other way exists to access nvvidconv (or results from other NVIDIA Gstreamer plugins) without making a memory copy operation? How we can access result in Managed or Cuda memory?
But you have mentioned two inputs and one output. It looks like you will perform camera frame stitching via CUDA? In this case, you have to leverage Argus, CUDA, and gstreamer.
We have a sample about Argus + gstreamer: tegra_multimedia_api/argus/samples/gstVideoEncode
I’m having a very similar use case with the original poster.
I’m trying to write a custom gstreamer element that uses cuda to stitch multiple frames. The pipeline is like this:
v4l2src-
v4l2src->my_element->nvenc->…
v4l2src-/
Obviously, the optimal way is to read the captured frame from cuda kernel directly, avoiding all the memcpy H2D or D2H. Would you share your thoughts about how to achieve this? Thank you!
Reading this sample code, I found HandleEGLImage being called in conv_capture_dqbuf_thread_callback. This seems to be where the cuda kernel touches the frame data directly from capture.
But because of the stitcher algo’s N-in-1-out nature, I cannot put it inside this callback. I need to pass the pointer out to a single thread that has access to all the input buffers. Do you see any problem to use frame pointer from outside the converters’ dequeue callback thread?
Moreover, running this sample on TX2, I did a nvprof:
==8861== NVPROF is profiling process 8861, command: ./camera_v4l2_cuda -d /dev/video1 -s 1920x1080 -f YUV420 -c
==8861== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
^CQuit due to exit command from user!
Quit due to exit command from user!
----------- Element = renderer0 -----------
Total Profiling time = 15.3871
Average FPS = 30.0252
Total units processed = 463
Num. of late units = 63
-------------------------------------
App run was successful
==8861== Profiling application: ./camera_v4l2_cuda -d /dev/video1 -s 1920x1080 -f YUV420 -c
==8861== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 4.7186ms 468 10.082us 3.4570us 11.718us addLabelsKernel(int*, int)
==8861== API calls:
Time(%) Time Calls Avg Min Max Name
34.98% 1.12203s 468 2.3975ms 347.79us 49.046ms cuGraphicsUnregisterResource
27.57% 884.09ms 468 1.8891ms 275.43us 34.506ms cudaLaunch
15.19% 487.09ms 468 1.0408ms 93.390us 22.706ms cuGraphicsEGLRegisterImage
14.83% 475.77ms 936 508.30us 24.588us 16.841ms cuCtxSynchronize
7.10% 227.75ms 468 486.64us 3.6500us 198.53ms cudaFree
0.12% 3.7150ms 468 7.9380us 1.3130us 291.31us cudaConfigureCall
0.11% 3.6158ms 936 3.8620us 640ns 131.27us cudaSetupArgument
0.10% 3.0881ms 468 6.5980us 1.2810us 224.62us cuEGLStreamProducerPresentDevicePtr
0.00% 92.204us 91 1.0130us 384ns 28.974us cuDeviceGetAttribute
0.00% 5.3470us 3 1.7820us 672ns 2.4020us cuDeviceGetCount
0.00% 4.9950us 1 4.9950us 4.9950us 4.9950us cuDeviceTotalMem
0.00% 3.1700us 3 1.0560us 608ns 1.8570us cuDeviceGet
0.00% 1.9210us 1 1.9210us 1.9210us 1.9210us cuDeviceGetName
the cuGraphicsUnregisterResource and cuGraphicsEGLRegisterImage still uses a considerable execution time. Is this really zero-copy, or it’s still doing memcpy under the hood?
You may found interesting the following information about the GstCUDA framework, I think that is exactly what you are looking for.
GstCUDA is a RidgeRun developed GStreamer plug-in enabling easy CUDA algorithm integration into GStreamer pipelines. GstCUDA offers a framework that allows users to develop custom GStreamer elements that execute any CUDA algorithm. The GstCUDA framework is a series of base classes abstracting the complexity of both CUDA and GStreamer. With GstCUDA, developers avoid writing elements from scratch, allowing the developer to focus on the algorithm logic, thus accelerating time to market.
GstCUDA offers a GStreamer plugin that contains a set of elements, that are ideal for GStreamer/CUDA quick prototyping. Those elements consist in a set of filters with different input/output pads combinations, that are run-time loadable with an external custom CUDA library that contains the algorithm to be executed on the GPU on each video frame that passes through the pipeline. GstCUDA plugin allows users to develop their own CUDA processing library, pass the library into the GstCUDA filter element that best adapts to the algorithm requirements, executes the library on the GPU, passing upstream frames from the GStreamer pipeline to the GPU and passing the modified frames downstream to the next element in the GStreamer pipeline. Those elements were created with the CUDA algorithm developer in mind - supporting quick prototyping and abstracting all GStreamer concepts. The elements are fully adaptable to different project needs, making GstCUDA a powerful tool that is essential for CUDA/GStreamer project development.
One remarkable feature of GstCUDA is that it provides a zero memory copy interface between CUDA and GStreamer on Jetson TX1/TX2 platforms. This enables heavy algorithms and large amounts of data (up to 2x 4K 60fps streams) to be processed on CUDA without the performance caused by copies or memory conversions. GstCUDA provides the necessary APIs to directly handle NVMM buffers to achieve the best possible performance on Jetson TX1/TX2 platforms. It provides a series of base classes and utilities that abstract the complexity of handle memory interface between GStreamer and CUDA, so the developer can focus on what actually gives value to the end product. GstCuda ensures an optimal performance for GStreamer/CUDA applications on Jetson platforms.