Questions regarding allocation of buffers/memory

Below are two snippets of code that perform the same task. The kernel obtains an input array, adds a constant to that, and returns the output.

Could someone explain to me the differences between these two snippets of code?

The first one is significantly faster than the second snippet.

Is this due to the fact that the first one is using pinned memory that is residing on the GPU maybe?

If you see/know something, please let me know!

Thanks, BHa

float *inputA  = NULL;
float *outputA = NULL;

cudaHostAlloc((void**)&inputA, dataSize, cudaHostAllocMapped);
cudaHostAlloc((void**)&outputA, dataSize, cudaHostAllocMapped);

add << < numBlocks, blockSize >> > (inputA, outputA);

cudaFreeHost(inputA);
cudaFreeHost(outputA);
float *inputA  = NULL;
float *devA    = NULL;
float *outputA = NULL;
float *devOutA = NULL;

cudaMallocHost(&inputA, dataSize);
cudaMalloc(&devA, dataSize);
cudaMallocHost(&outputA, dataSize);
cudaMalloc(&devOutA, dataSize);

cudaMemcpyAsync(devA, inputA, dataSize, cudaMemcpyDefault);
add << < numBlocks, blockSize >> > (devA, devOutA);
cudaMemcpyAsync(outputA, devOutA, dataSize, cudaMemcpyDeviceToHost);

cudaFreeHost(inputA);
cudaFree(devA);
cudaFreeHost(outputA);
cudaFree(devOutA);

pinned memory doesn’t reside on the GPU.

If you are only focused on kernel execution time (not clear how you are measuring time) then I would expect the second case to be faster than the first case.

If you are timing the entire code sequence, I suppose it might be possible that the first case is faster, perhaps you are timing CUDA start up overhead, or perhaps the additional cudaMalloc/cudaMemcpy operations are enough to make the difference. A well-written kernel that only reads and writes the data once and does so in an orderly fashion, will incur approximately the same cost by reading from mapped host memory as it would by performing the cudaMemcpy operations before and after.

txbob

I’m looking at the overall execution in NVVP.

I’m also timing the kernel execution times in my code (I left it out to be concise).

Overall execution is roughly the same (1.46s vs 1.34s) but kernel execution times are different (these include transfer times).

In my actual code, I’m comparing a 3 stream system using the second snippet to a single stream of the first snippet and it (the single stream) is 33% faster on 16MB data sets (4 sets of 2). ~12 ms vs ~18 ms.

I also assumed that the second method would be faster txbob. I followed Mark Harris’ “How to Optimize Data Transfers” & “How to Overlap Data Transfers” assuming he knew better than I which is more similar to the second method.

This is why I brought this question up because I don’t understand what it happening and was hoping someone else would.

Thanks,

BHa

(If required / desired, I would be willing to post / email, the both code bases as well as upload photos of NVVP for both codes)

At the kernel level, the 2nd method should definitely be faster.

My proof case:

$ cat t103.cu
#include <stdio.h>

const int N = 16384*1024;
const float offset = 1.0f;
const size_t dataSize = N*sizeof(float);
const int blockSize = 512;

__global__ void add(float *in, float *out){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  if (idx < N) out[idx] = in[idx] + offset;
}

int main(){

  float et;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  const int numBlocks = (N+blockSize-1)/blockSize;
  float *inputA  = NULL;
  float *outputA = NULL;

  cudaHostAlloc((void**)&inputA, dataSize, cudaHostAllocMapped);
  cudaHostAlloc((void**)&outputA, dataSize, cudaHostAllocMapped);

  add << < numBlocks, blockSize >> > (inputA, outputA); //warm up
  cudaEventRecord(start);
  add << < numBlocks, blockSize >> > (inputA, outputA);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("mapped: %fms\n", et);

  cudaFreeHost(inputA);
  cudaFreeHost(outputA);

float *devA    = NULL;
  float *devOutA = NULL;

  cudaMallocHost(&inputA, dataSize);
  cudaMalloc(&devA, dataSize);
  cudaMallocHost(&outputA, dataSize);
  cudaMalloc(&devOutA, dataSize);

  cudaMemcpyAsync(devA, inputA, dataSize, cudaMemcpyDefault);
  add << < numBlocks, blockSize >> > (devA, devOutA); // warm up
  cudaEventRecord(start);
  add << < numBlocks, blockSize >> > (devA, devOutA);
  cudaEventRecord(stop);
  cudaMemcpyAsync(outputA, devOutA, dataSize, cudaMemcpyDeviceToHost);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("ordinary: %fms\n", et);

  cudaFreeHost(inputA);
  cudaFree(devA);
  cudaFreeHost(outputA);
  cudaFree(devOutA);
}
$ nvcc -arch=sm_61 -o t103 t103.cu
$ ./t103
mapped: 6.300672ms
ordinary: 0.387072ms
$

Ubuntu 14.04, CUDA 8.0.61, Titan X (Pascal)

Since you are on windows you want to be sure that you are building a release project for timing analysis, not a debug project. Windows WDDM can also get in the way of accurate timing, although the profiler should give accurate numbers for kernel execution.

Maybe it is my side then…
Running your example this is what I got.

mapped: 0.002528ms
ordinary: 0.002560ms

On Windows 10 with a K5100M.

This is quite interesting…

Maybe something to do with Compute Capability?

Those kernels are not launching correctly.

Yes, you need to set the compute capability of any project to match your GPU. Otherwise, my kernels will not launch correctly (they will not launch correctly if you are compiling for the default cc2.0 architecture). I didn’t put error checking in the code, but proper error checking and/or running the code with cuda-memcheck or the nsight memory checking enabled should identify the issue also.

Okay!

I look into that txbob.

I think you might be on to something as I have not been error checking.
In my original code I was checking the output to the input but maybe something else is happening that isn’t evident.

Thanks!

BHa

You can also work around (eliminate) the problem with cc2.0 (default) compilation by either reducing the N by a factor of 2 or just increase the threadblock size from 256 to 512. I’ve edited the code above to make this last change.

Thanks!

I am running in Release x64 mode due to the exact reasons you mentioned earlier.

Rerunning your code makes a lot more sense now.
mapped: 7.020800ms
ordinary: 3.066848ms

This is more what I expected to see as well with comparison of my gpu to the Titan X (Pascal) (=P Love the fact that this needs to be included hahahaha.)

Thanks for all of your help and guidance txbob!

Hrmm this is quite interesting.

Using the same timing method that you used, I get that the ordinary kernels execute roughly 3 times faster than the mapped ones in my code. But when you include transfer times, the mapped ones come out to be roughly 2 times faster than the ordinary kernels.

Speculating from that, ordinary memory might be better when you have extensive operations done on your data whereas, mapped might be better for small operations.

Looking inside of NVVP supports what I’m saying here as well.
Very interesting…

Yes, that is expected also, for this particular kernel which has very orderly memory access patterns and no data reuse. It will also depend on the GPU you are using. A fast GPU will not show a 2x improvement comparing the cases.

For the mapped case, we have 3 items to account for:

  1. data transfer to the device
  2. computation
  3. data transfer to host

For the “ordinary” case, we have 5 items to account for:

  1. data transfer to device memory
    2a. data transfer from device memory to device
    2b. computation
    2c. data transfer from device to device memory
  2. data transfer from device memory to host

Items 1 and 3 are roughly comparable in their “cost” for each case.

For a relatively slower GPU, in the mapped case, the activity associated with item 1 is actually a form of latency which can be partially “hidden” by GPU parallelism. As the code is constructed in the “ordinary” case, however, latency hiding doesn’t help with the host->device transfer. Furthermore, it’s evident that there are fewer operations to account for in the mapped case. But since the kernel in the mapped case absorbs all the time assocated with 1-3, whereas in the ordinary case the kernel only absorbs time associated with 2a-2c, it should be evident that the kernel will be faster in the “ordinary” case.

You don’t want to draw a lot of conclusions from this however. If this were the exact workload you wanted, it would be foolish to do it on the GPU. The CPU would be faster than any of these approaches. Real workloads tend to benefit from having their data resident in GPU memory.

Thanks txbob,

This gives me insight into the inner workings. I totally agree with you that this workload is stupid for a GPU and this was mainly done just so that I could learn CUDA.

I really do appreciate all of your time and help on this though.

Thanks again,

BHa