If a block of memory is registered via cudaHostRegister, will it speed up cudaMemcpy operations to any part of the block, or only copies where the address passed to cudaMemcpy* is the same address as passed to cudaHostRegister?
That is
void * ptr = malloc(4096);
void *dev; cudaMalloc(&dev, 4096);
cudaHostRegister(ptr, 4096, cudaHostRegisterDefault);
cudaMemcpy(ptr, dev, 2048, cudaMemcpyDeviceToHost); // accelerated
cudaMemcpy(ptr + 2048, dev+2048, 2048, cudaMemcpyDeviceToHost); // accelerated ???
- Will the second cudaMemcpy call recognize the memory is registered?
- Is there any (substantive) penalty to using the interior pointer?
Thanks.