Tegra K1 MatVec Multiplication Benchmark Revision (Zero Copy vs Unified Memory)

Hello CUDA Performance Forum,

first of all I would like to apologize for every mistake I make, since English is not my native language. Second of all, the code below might be a little bit confusing. I was responsible to implement MPI and CUDA for our university project and recently switched to running some benchmark tests(The .cu file gets compiled apart and is linked to the C++ compiler who is compiling a .cpp file with the main routine)

Well, I am currently comparing the memory usage options as know as unified memory and zero copy running on the same kernel. The kernel at hand is a matrix vector multiplication with the matrix being stored in the ELLPack storage format. Since the matrix consists of 7 diagonals the data and indices array have a dimension of dim*7. I get the following run time results:
zero copy vs unified memory run time

On the left side we see the kernel run time where the unified memory kernel performance x3 better than the zero copy kernel. I hit a peak bandwidth of close to 12GB/s out of the possible 14.9GB/s using unified memory. Overall both methods don’t fall apart very much.

Now, please compare my results to those graphs (page 53 and page 54) found in a presentation hold at the GPU Technology Conference 2014. Here we see the kernel using zero copy score better or equal than the kernel using unified memory which is a discrepancy to my results. Overall the zero copy method has a slight offset compared to other options but that does not seem to significant and is somewhat the same result i got.

So (finally) here are my questions:

  • Why is the zero copy option slower than the unified memory method in my example?
  • Why is the zero copy option better or equal in the Tech.Conference example?
  • Why is there a discrepancy in the two above?
  • Do I do time everything right or is there a mistake in my code?

As stated above, all CUDA functions are in a different .cu file than the main routine. In between the following allocation and kernel call functions, data is copied from a CPU computed sample matrix. I do multiple runs to guaranty I run long enough for smaller dimensions. Standard is 1000 iterations.

Kernel:

template<typename type>
__global__ void  gpu_ax(type* data, type* fvec, type* result, int* indices, int max_row_length, int dim_local)
{

    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if(idx<dim_local)
    {
      int col;
      type svalue = 0, value;
      for(int i = 0;i < max_row_length; i++)
      {
        value = data[i*dim_local+idx];
        col = indices[i*dim_local+idx];
        svalue += value*fvec[col];
      }
      result[idx]=svalue;
    }
}


Unified Memory:

Allocation:

template<typename Scalar>
void alloc_unified(Scalar **data, Scalar **fvec, Scalar **result, int **indices, int max_row_length, int dim_local,int dim_fvec)
{
    cudaMallocManaged((void **)data, sizeof(Scalar)*dim_local*max_row_length);
    cudaMallocManaged((void **)fvec, sizeof(Scalar)*dim_fvec);
    cudaMallocManaged((void **)result, sizeof(Scalar)*dim_local);
    cudaMallocManaged((void **)indices, sizeof(int)*dim_local*max_row_length);
}
template void alloc_unified<int>(int **data, int **fvec, int **result, int **indices, int max_row_length, int dim_local, int dim_fvec);
template void alloc_unified<float>(float **data, float **fvec, float **result, int **indices, int max_row_length, int dim_local, int dim_fvec);
template void alloc_unified<double>(double **data, double **fvec, double **result, int **indices, int max_row_length, int dim_local, int dim_fvec);

Kernel setup and call:

template<typename Scalar>
float mult_vec_unified_time(Scalar *data, Scalar *fvec, Scalar *result, int *indices, int max_row_length, int dim_local, int dim_fvec, int runs)
{
    cudaEvent_t start_unified, stop_unified;
    cudaEventCreate(&start_unified);
    cudaEventCreate(&stop_unified);
    
    int num_blocks = ceil((double)dim_local/1024);
    int num_threads = ceil(((double)dim_local/num_blocks)/32)*32;
    
    cudaEventRecord(start_unified);
    for (int i = 0; i < runs; i++)
    {
        gpu_ax<<<num_blocks,num_threads>>>(data,fvec,result,indices,max_row_length, dim_local);
        
    }
    cudaEventRecord(stop_unified);
    
    cudaEventSynchronize(stop_unified);
    float elapsedTime_unified = 0.0;
    cudaEventElapsedTime(&elapsedTime_unified, start_unified, stop_unified);

    cudaEventDestroy(start_unified);
    cudaEventDestroy(stop_unified);
    cudaDeviceSynchronize();
    return (elapsedTime_unified / (float)runs);
}
template float mult_vec_unified_time<int>(int* data, int* fvec, int* result, int* indices, int max_row_length, int dim_local,int dim_fvec, int runs);
template float mult_vec_unified_time<float>(float* data, float* fvec, float* result, int* indices, int max_row_length, int dim_local, int dim_fvec, int runs);
template float mult_vec_unified_time<double>(double* data, double* fvec, double* restult, int* indices, int max_row_length, int dim_local, int dim_fvec, int runs);

Zero Copy:
Allocation:

template<typename Scalar>
void alloc_zero(Scalar **data, Scalar **fvec, Scalar **result, int ** indices, int max_row_length, int dim_local, int dim_fvec)
{
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop,0);

    if(prop.canMapHostMemory)
    {
      cudaSetDeviceFlags(cudaDeviceMapHost);

      cudaHostAlloc((void **)data, sizeof(Scalar)*max_row_length*dim_local, cudaHostAllocMapped);
      cudaHostAlloc((void **)fvec, sizeof(Scalar)*dim_fvec, cudaHostAllocMapped);
      cudaHostAlloc((void **)result, sizeof(Scalar)*dim_local, cudaHostAllocMapped);
      cudaHostAlloc((void **)indices, sizeof(int)*max_row_length*dim_local, cudaHostAllocMapped);
    }
}
template void alloc_zero<int>(int **data, int **fvec, int **result, int **indices, int max_row_length, int dim_local, int dim_fvec);
template void alloc_zero<float>(float **data, float **fvec, float **result, int **indices, int max_row_length, int dim_local, int dim_fvec);
template void alloc_zero<double>(double **data, double **fvec, double **result, int **indices, int max_row_length, int dim_local, int dim_fvec);

Kernel setup and call:

template<typename Scalar>
float mult_vec_zero_time(Scalar *data, Scalar *fvec, Scalar *result, int *indices, int max_row_length, int dim_local, int dim_fvec, int runs)
{
    cudaEvent_t start_zero, stop_zero;
    cudaEventCreate(&start_zero);
    cudaEventCreate(&stop_zero);

    Scalar *d_data, *d_fvec, *d_result;
    int *d_indices;

    cudaHostGetDevicePointer((void **)&d_data,(void *)data, 0);
    cudaHostGetDevicePointer((void **)&d_fvec, (void *)fvec, 0);
    cudaHostGetDevicePointer((void **)&d_result, (void *)result, 0);
    cudaHostGetDevicePointer((void **)&d_indices, (void *)indices, 0);

    int num_blocks = ceil((double)dim_local/1024);
    int num_threads = ceil(((double)dim_local/num_blocks)/32)*32;

    cudaEventRecord(start_zero);
    for (int i=0;i<runs;i++)
    {
        gpu_ax<<<num_blocks,num_threads>>>(d_data, d_fvec, d_result, d_indices, max_row_length, dim_local);
   
    }
    cudaEventRecord(stop_zero);

    cudaEventSynchronize(stop_zero);
    float elapsedTime_zero = 0.0;
    cudaEventElapsedTime(&elapsedTime_zero, start_zero, stop_zero);

    cudaEventDestroy(start_zero);
    cudaEventDestroy(stop_zero);
    cudaDeviceSynchronize();
    cleanup(d_data, d_fvec, d_result, d_indices, 0);
    return (elapsedTime_zero /(float)runs);
}
template float mult_vec_zero_time<int>(int* data, int* fvec, int* result, int* indices, int max_row_length, int dim_local, int  dim_fvec, int runs);
template float mult_vec_zero_time<float>(float* data, float* fvec, float* result, int* indices, int max_row_length, int dim_local, int dim_fvec, int runs);
template float mult_vec_zero_time<double>(double* data, double* fvec, double* restult, int* indices, int max_row_length, int dim_local, int dim_fvec, int runs);

I am looking forward to any discussion and answer and would like to thank you in advance!

I don’t have experience with the TK1, or unified memory, but would make the following observations:

(1) In a TK1 system the host and device memories are physically the same memory
(2) As pointed out in other threads, 12-13 GB/sec is the maximum practically achievable bandwidth for that memory, and your app is achieving it
(3) According to your graph, app performance using either unified memory or zero copy is essentially identical. While individual components of overhead may be distributed differently, overall overhead is identical for both versions as they are ultimately limited by memory throughput.

Based on that, I don’t find your high-level results surprising. People familiar with the TK1 may be able to point out relevant issues affecting this application.

let’s start with your code. how you have measured overall runtime for zero-copy version? it’s 10x larger than kernel times, am i correct that you essentially measured time required to alloc and init memory on CPU side?

That is a point I do not fully understand yet. I imagine both kernel calls, due to the fact we have indeed physically the same memory, to take the same amount of time (like the examples from the Tech. Conf.) I would like to dig deeper here and get a precise answer on why the zero copy version of my app is significantly slower when it comes down to kernel time.

This is the routine I use to get my overall timings. In the code below you can see the zero copy version while the unified memory version being practically the same. The function set_values does, as described in my first post, a copy from a sample matrix into our current data used. The other functions can also be seen above:

timer_overall.start();
    for (int r = 0; r<iteration; r++)
    {
        float *data_zero = NULL;
        float *fvec_zero = NULL;
        float *result_zero = NULL;
        int *indices_zero = NULL;

        alloc_zero(&data_zero, &fvec_zero, &result_zero, &indices_zero, maxrowlength, dimlocal, dimfvec);
        set_values(data_host, indices_host, fvec_host, data_zero, indices_zero, fvec_zero, maxrowlength, dimlocal, dimfvec);

        //Kernel
        mult_vec_zero(data_zero, fvec_zero, result_zero, indices_zero, maxrowlength, dimlocal, dimfvec);

        //TODO: test (0=CudaFree,1=CudeFreeHost,2=delete[])
        //cleanup(data_zero, fvec_zero, result_zero, indices_zero, 0);
        cleanup(data_zero, fvec_zero, result_zero, indices_zero, 1);
        //cleanup(data_zero, fvec_zero, result_zero, indices_zero, 2);
    }
    float elapsed_zero_overall = timer_overall.stop()/(float) iteration;

I measured everything here, the whole process of allocating, launching the kernel and cleaning the whole thing again. This is more or less the setup that we are using within our university project for one node.