Adding post process to Cuda Video Decoder

I’m trying to modify the cudaDecodeGL CUDA_Sample that comes with Cuda 8.0. As you probably know, the code works by bringing a compressed file in, decompressing it and performing a single post-processing step where it color converts from YCbCr to RGBA. I’m trying to figure out how to inject my own additional post-process step into the process. For example, I’d like to take the RGB output and perform some additional steps in Cuda.

So the decoded frame (i.e. the allocated memory on the GPU) is created using the g_pVideoDecoder->mapFrame function that calls the cuvidMapVideoFrame function.

And the Interop frame (the allocated memory that openGL operates on) is created with the g_pImageGL->map operation that calls the cuGLMapBufferObject function. Presumably, this is performing all the interoperability steps required to get Cuda to talk to OpenGL.

I want to create and operate on an intermediate frame buffer. Neither of the above mapping steps seem to do this job (the 1st is interfacing with the decoder and the 2nd is taking steps to interop with openGL). Is there a mapping function w/in the Cuda Video Decoder Library that allows for the creation of an intermediate array. I’m currently trying to use cudaMalloc to create the memory array. However, I’m failing when it comes to launching a kernel … I learned my Cuda from the ‘Cuda By Example’ book where kernels were launched with a good ol’ fashioned:

kernel<<<block,threads>>>(---argument list---)

But in the Cuda Video Decoder library, there is the new cuLaunchKernel function that adds in things like pointers to CUfunctions and some kind of *.ptx file.

Can anybody provide some clarity or point me to an example showing how to integrate this intermediate processing step?

I was finally able to get this working by defining my own array in GPU memory with commands such as the following:

CUdeviceptr pnewFrame;
checkCudaErrors(cuMemAlloc(&pnewFrame, size_of_frame));

And then using the pointer pnewFrame in the cudaDecodeGL PostProcess function, I sent this newFrame in as the destination to which the NV12toARGB color processing sent data. For example, I changed the line in the original sample to read:

eResult = cudaLaunchNV12toARGBDrv(*ppDecodedFrame, nDecodedPitch, *pnewFrame, pFramePitch, 
                                   nWidth, nHeight, fpCudaKernel, streamID);

Then I launched my own processing kernel on this data and sent it on to the openGL interop buffer and it all seemed to work.