Zero Copy Memory vs Unified memory CUDA processing

lamegeorge · July 20, 2017, 6:57pm

Hi Guys,

I am trying to implement a frame difference kernel on CUDA. I obtain my frame using NVBuffer which I cudaMemcpy into Zero Copy Memory ( memory allocated using cudaHostAlloc ) and use it to do the operation. However, when I try to accomplish the same task by cudaMemcpy of frame buffer into Unified Memory ( memory allocated using cudaMallocManaged )the time taken by the kernel is more than that taken using Zero Copy Memory.

Why is this happening? Is it that Unified Memory access is not as efficient as Zero Copy memory access ?

Thanks

AastaLLL · July 21, 2017, 3:24am

Hi,

Usually, unified memory should have better performance.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

Not sure why your use case is not in the expectation. (Some memory copy behind?)
Could you share detail source code for us?

Thanks.

lamegeorge · July 21, 2017, 5:58am

Hi AastaLLL,

Yes. There is some memory copy which is happening. Please find the code snippets below:

code snippet using zero copy memory :

int fsize1 = 2048 * 1080 ; // for 1920x1080 resolution

	char *m_datamem;

	char *m_hcurrframe;
	char *m_dcurrframe;

	char *m_hcudaout;
	char *m_dcudaout;

	int alloc1 = cudaHostAlloc((void **)&m_hcurrframe, fsize1, cudaHostAllocMapped);
	int getPtr1 = cudaHostGetDevicePointer((void **)&m_dcurrframe, (void *) m_hcurrframe, 0);

	int alloc2 = cudaHostAlloc((void **)&m_hcudaout, fsize1, cudaHostAllocMapped);
	int getPtr2 = cudaHostGetDevicePointer((void **)&m_dcudaout, (void *) m_hcudaout, 0);	

while (true)
    {
        // Acquire a Frame.
        UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);
        if (!iFrame)
            break;

        // Get the Frame's Image.
        Image *image = iFrame->getImage();
        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);
        if (status != STATUS_OK)
               TEST_ERROR_RETURN(status != STATUS_OK, "Failed to create a native buffer");

 #if 1

	cudaSetDeviceFlags(cudaDeviceMapHost);

        NvBufferParams params;
        NvBufferGetParams(fd, &params);

	cout<<"params.pitch[0] : "<< params.pitch[0] <<endl;
	cout<<"params.offset[0] : "<< params.offset[0] <<endl;

        int fsize = params.pitch[0] * m_framesize.height ;
        
        m_datamem = (char *)mmap(NULL, fsize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, params.offset[0]);

  	struct timeval tp1;
    	gettimeofday(&tp1, NULL);
    	long start1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;
	
	int copy1 = cudaMemcpy (m_dcurrframe,m_datamem,fsize,cudaMemcpyHostToDevice) ;	

	cout<<endl<<"copy1 :  "<<copy1 <<endl;	

	float timediff = diff10(m_dcudaout,m_dcurrframe, m_framesize.width, m_framesize.height,params.pitch[0]);
	
	
  	printf("Finished diff operation after %f ms.\n", timediff);
	
	cudaDeviceSynchronize();	

  	gettimeofday(&tp1, NULL);
    	long end1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;

	long deltaT = end1 - start1;

	cout<< "Time Taken : " << deltaT <<" ms"<<endl;

Code Snippet using Unified memory :

int fsize1 = 2048 * 1080 ; // for 1920x1080 resolution

	char *m_datamem;

	char *m_hcurrframe;
	char *m_dcurrframe;

	char *m_hcudaout;
	char *m_dcudaout;

	int alloc1 = cudaMallocManaged(&m_dcurrframe, fsize1);

	int alloc2 = cudaHostAlloc((void **)&m_hcudaout, fsize1, cudaHostAllocMapped);
	int getPtr2 = cudaHostGetDevicePointer((void **)&m_dcudaout, (void *) m_hcudaout, 0);	

while (true)
    {
        // Acquire a Frame.
        UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);
        if (!iFrame)
            break;

        // Get the Frame's Image.
        Image *image = iFrame->getImage();
        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);
        if (status != STATUS_OK)
               TEST_ERROR_RETURN(status != STATUS_OK, "Failed to create a native buffer");

 #if 1

	cudaSetDeviceFlags(cudaDeviceMapHost);

        NvBufferParams params;
        NvBufferGetParams(fd, &params);

	cout<<"params.pitch[0] : "<< params.pitch[0] <<endl;
	cout<<"params.offset[0] : "<< params.offset[0] <<endl;

        int fsize = params.pitch[0] * m_framesize.height ;
        
        m_datamem = (char *)mmap(NULL, fsize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, params.offset[0]);

  	struct timeval tp1;
    	gettimeofday(&tp1, NULL);
    	long start1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;
	
	int copy1 = cudaMemcpy (m_dcurrframe,m_datamem,fsize,cudaMemcpyHostToDevice) ;	

	cout<<endl<<"copy1 :  "<<copy1 <<endl;	

	float timediff = diff10(m_dcudaout,m_dcurrframe, m_framesize.width, m_framesize.height,params.pitch[0]);
	
	
  	printf("Finished diff operation after %f ms.\n", timediff);
	
	cudaDeviceSynchronize();	

  	gettimeofday(&tp1, NULL);
    	long end1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;

	long deltaT = end1 - start1;

	cout<< "Time Taken : " << deltaT <<" ms"<<endl;

THe time taken by cuda kernel when memory is allocated using cudaMallocManaged (Unified Memory) is usually more than that taken by cudaHostAlloc (Zero Copy Memory)

Thanks.

AastaLLL · July 24, 2017, 6:40am

Hi,

There are some memcpy and even including mapped → managed copy.
This will lower the performance.

Want to confirm your use-case first:

Read camera first.(Argus)
Do a kernel code(diff10)

Is this correct?

AastaLLL · July 24, 2017, 6:46am

Hi,

If you are looking for Argus → CUDA kernel sample, please check this:
/home/ubuntu/tegra_multimedia_api/samples/11_camera_object_identification/

Thanks.

lamegeorge · July 24, 2017, 10:39am

Hi AastaLLL,

Thanks for the response. Yes you are right about the use-case. I intend to read camera first and then perform some operations in kernel code. There is some memcpy happening from data_mem obtained using NVBuffer. However, the same memcpy happens in case of zero copy code.

The time which I have emphasized is only the time taken in CUDA kernel code returned in the variable ‘time’ which means it is devoid of time taken to copy. The time taken by CUDA kernel code alone is much higher in the case when I use ‘Unified Memory’ compared to the case when I use ‘Zero Copy Memory’.

Thanks.

AastaLLL · July 25, 2017, 8:48am

Hi,

From your code, the time = memcpy + CUDA kernel.
Could you calculate kernel function only?

Since the cudaMemcpy of these two cases is quite different:

Host → Mapped
Host → Unified

Another weakness of zero-copy memory is that access may be slow sometimes depending on where it is being accessed from.
We will try to test this case internally and update information to you.

Thanks.

AastaLLL · July 25, 2017, 9:14am

Hi,

Looks into your code, there doesn’t Kernel code at all.
Is the kernel code is embedded in the diff10()?

lamegeorge · July 25, 2017, 1:19pm

Hi AastaLLL,

Please note that the time which is emphasized is only time taken to run kernel function. If we use unified memory in kernel code then it takes 10x more time than mapped zero copy memory.
Yes you are right kernel code is hidden in diff10 call. The code snippet of diff10 looks like the following :

__global__ void diff_mats_char_atomics11(
    char *output,char *input1,char *input2,
    unsigned int width,
    unsigned int height,unsigned int pitch)
{
	
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  output[index] = abs(input1[index] -input2[index]) ;
    
  __syncthreads();

}



float run_mats_char_atomics11(
    char *output,char *input1,char *input2,
    unsigned int width,
    unsigned int height, unsigned int pitch)
{

     
  // Launch kernel on 1M elements on the GPU
    int blockSize = 256;

    int numBlocks = (pitch*height ) / blockSize;
 
    cudaEvent_t start;
    cudaEvent_t stop;

    cudaEventCreate(&stop);
    cudaEventCreate(&start);

    cudaEventRecord(start, 0);

    diff_mats_char_atomics11<<<numBlocks, blockSize>>>( output,input1,input2,width,height,pitch);
    

    cudaEventRecord(stop, 0);

    cudaEventSynchronize(stop);
    float elapsed_millis;
    cudaEventElapsedTime(&elapsed_millis, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

 
    return elapsed_millis;
}


float diff10(char *output,char *input1,char *input2, unsigned int width, unsigned int height , unsigned int pitch)
{
	return run_mats_char_atomics11(output,input1,input2, width, height,pitch);
}

Please let me know what you think about this.

Thanks.

AastaLLL · July 26, 2017, 9:45am

Hi,

I modified your code to a sample but found the execution time is similar.

#include <iostream>
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <helper_timer.h>

bool used_UM = true;

__global__ void diff10(char* output, char* input, size_t width, size_t height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;

    if( idx<width && idy<height ) output[idy*width+idx] = input[idy*width+idx];
}

int main ()
{

    cv::Mat img = cv::imread("cat.jpg");
    cv::Mat gray;
    cv::cvtColor(img, gray, CV_BGR2GRAY);
    int width = img.size().width;
    int height = img.size().height;

    int fsize1 = width*height*sizeof(unsigned char);
    char *m_hcurrframe;
    char *m_dcurrframe;

    char *m_hcudaout;
    char *m_dcudaout;

    if(!used_UM) {
        int alloc1 = cudaHostAlloc((void **)&m_hcurrframe, fsize1, cudaHostAllocMapped);
        int getPtr1 = cudaHostGetDevicePointer((void **)&m_dcurrframe, (void *) m_hcurrframe, 0);

        int alloc2 = cudaHostAlloc((void **)&m_hcudaout, fsize1, cudaHostAllocMapped);
        int getPtr2 = cudaHostGetDevicePointer((void **)&m_dcudaout, (void *) m_hcudaout, 0);
    } else {
        int alloc1 = cudaMallocManaged(&m_dcurrframe, fsize1);
        int alloc2 = cudaHostAlloc((void **)&m_hcudaout, fsize1, cudaHostAllocMapped);
        int getPtr2 = cudaHostGetDevicePointer((void **)&m_dcudaout, (void *) m_hcudaout, 0);
    }

    struct timeval tp1;
    gettimeofday(&tp1, NULL);
    long start1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;

    int copy1 = cudaMemcpy (m_dcurrframe, gray.data,fsize1,cudaMemcpyHostToDevice);

    dim3 dimBlock(32,32);
    dim3 dimGrid;
    dimGrid.x = ( width+dimBlock.x-1) /  dimBlock.x;
    dimGrid.y = (height+dimBlock.y-1) /  dimBlock.y;
    diff10<<<dimGrid, dimBlock>>>(m_dcudaout,m_dcurrframe, width, height);
    cudaDeviceSynchronize();	

    gettimeofday(&tp1, NULL);
    long end1 = tp1.tv_sec * 1000 + tp1.tv_usec / 1000;
    long deltaT = end1 - start1;
    std::cout<< "Time Taken : " << deltaT <<" ms"<< std::endl;

    cv::Mat CudaOUTimgbuf1 = cv::Mat(height, width, CV_8UC1, (void *)m_hcudaout);
    cv::imshow("ALL GOODS", CudaOUTimgbuf1);
    cv::waitKey(0);
    return 0;
}

bool used_UM = false; (zero-copy)
Time Taken : 1 ms

bool used_UM = true; (unified memory)
Time Taken : 1 ms

To check not a hardware issue, could you also try this sample on your device?

nvcc topic_1018809.cu -lopencv_core -lopencv_highgui -lopencv_imgproc -I/home/ubuntu/NVIDIA_CUDA-8.0_Samples/common/inc -o test && ./test

Thanks.

lamegeorge · October 11, 2017, 6:11am

Hi AastaLLL,

Thanks for the help. I have tried your code and I am measuring ~9 ms for cudaMemcpy on 1080p resolution image. I intend to directly read frame from camera to Unified Memory without using cudaMemcpy as it is expensive and cannot be afforded by the application I am trying to develop. In this context, I browsed through the samples in NVIDIA_CUDA-8.0_Samples , tegra_multimedia_api , VisionWorks-1.6-Samples , VisionWorks-SFM-0.90-Samples but failed to get hold of any sample which uses Unified Memory in this way.

Could you please let me know if I am missing something here or can you point me to any other place where I can look out for similar samples.

Thanks.

AastaLLL · October 12, 2017, 8:13am

Hi,

If you are looking for an optimal example for the pipeline like:
V4l2 camera → DMA buffer → CUDA

It’s recommeneded to check our jetson_inference sample.
This sample read camera via GStreamer, feed the data to CUDA for preprocess, and parse the buffer to TensorRT for deep learning inferencing.

Similar workflow should also be available in MMAPI.
Thanks.

anas.abuzaina427ed · October 12, 2017, 2:18pm

Hi I have the same problem with Unified memory on the Jetson TX2. I have a custom camera that gives a pointer to image data at 30 fps, i want to run some OpenCV Cuda kernels on the images, I managed to do it with Zero copy but I want to use unified memory to improve the speed. However I keep getting the following error :

what():  /home/nvidia/OpenCV/3/opencv-3.3.0/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:270: error: (-217) unspecified launch failure in function call

My code looks like this:

int main(int argc, char * argv[])
{

int rows = 2160;
    int cols = 3840;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    
    uchar* ptr;
    
    cudaMallocManaged(&ptr, sizeof(uchar)*rows*cols);

cv::cuda::Stream stream;

   for (int loop=0; loop <30; ++loop)
    {
    
     ptr = getframePtrFromCamera();

    cv::cuda::GpuMat gm_1(cv::Size(cols, rows), CV_8UC3 , ptr);     cudaDeviceSynchronize();

    cv::cuda::cvtColor(gm_1, gm_1, cv::COLOR_RGB2BGR, 0,stream);   

   }

return 0;
}

AastaLLL can you please edit my code to work using unified memory ? Many thanks

AastaLLL · October 13, 2017, 2:57am

Hi,

Support your camera capture color image, please allocate buffer for three color channel:

cudaMallocManaged(&ptr, sizeof(uchar)*rows*cols);
->
cudaMallocManaged(&ptr, <b>3</b>*sizeof(uchar)*rows*cols);

lamegeorge · October 13, 2017, 8:51am

Hi anas.abuzaina427ed / AastaLLL,

It seems like you are trying to use Unified memory similar to what I intend to.

Could you please let me know what getframePtrFromCamera() does and how you are using cv::cuda::Stream ? If possible, could you please share the code snippet to get the pointer from camera ?

Thanks.

anas.abuzaina427ed · October 13, 2017, 3:23pm

AsstaLLL that did not work, I still get the same error. I noticed something strange. If you remove the for loop it will work, so if you do it once it will work, but if it is recursive ( in the for loop) it gives that error? Do you have any idea why? It is important for me that it works recursively as I am capturing live frames.

lamegeorge, that functions is from the camera api I am using, it simply returns a uchar pointer to the image data.

lamegeorge · October 13, 2017, 6:00pm

Hi anas.abuzaina427ed,

I have the following queries in this context :

If we replace the pointer ‘ptr’ by the frame data pointer returned by camera API, then how is Unified memory which has been allocated earlier still in play ?
The frame data pointer returned by camera API refers to memory on the CPU or GPU ?

Thanks.

AastaLLL · October 16, 2017, 3:38am

Hi anas.abuzaina427ed,

Not sure if the memory pointer refreshes by frame.
Usually, we update frame data to the same buffer location.

We will check this issue later and update information to you.

Thanks.

anas.abuzaina427ed · October 16, 2017, 1:50pm

Hi AsataLLL, I will be waiting for your reply.

Hi Lamegeorge, for question 1, according to my understanding you with unified memory you can alocate the memory then fill it, but I am not sure about it. For question too. Since the CPU and the GPU share the same memory it should not make a difference, and the idea behind all this is to avoid copying between GPU and CPU. In my code the pointer is in the CPU but I want to make it unified so the GPU can use it without copying it first.

AastaLLL · October 18, 2017, 5:43am

Hi,

Sorry for the late.
From this command, the buffer pointer changes from time to time:

ptr = getframePtrFromCamera();

Not sure if this pointer is a GPU-touchable pointer.
Please recheck the getframePtrFromCamera() function for details.

Thanks.