cudaMallocHost How to use

Can someone explain how to use cudaMallocHost? My code is working using cudaMalloc. Naively, I thought I could simply change cudaMalloc to cudaMallocHost and the code would still work. But it seems other changes are required. I have tried all I can think of. I need help.

Here are code snippets that show the way I am doing all calls. The code does exactly what I want in the kernel. All is fine, although too slow.
So I thought I would try to change cudaMalloc to cudaMallocHost to see what difference in speed there would be but then I get no results at all. The code compiles, I get no run-time errors but I know the kernel call RejectingonDevice does nothing or if it does the way I am coding the calls is faultyā€¦

I have tried all I can think of. Should cudaMemcpyHostToDevice also be changed to cudaMemcpyHostToHost? Tried, did not help. I am sure some-one will know instantly what trivial thing I am doing wrong. I hope they can also tell me. Thanks. And maybe the answer will help others who are as naive as me.

All I want to do is change cudaMalloc to cudaMallocHost and be told what else to alter in code snippets below or maybe I do not understand something and it is not possible?. Thanks

void dumb_rejecting (int g_nots, int* dev_nots)
{
cudaMemcpy(dev_nots, g_nots, gsgssizeof(int), cudaMemcpyHostToDevice);

int blockSize = gs*gs; // always less than 512
int nBlocks = 1; 

FunctionOnDevice <<< nBlocks, blockSize >>> (dev_nots);

cudaMemcpy(g_nots, dev_nots, gs*gs*sizeof(int), cudaMemcpyDeviceToHost);

}

int dumb_not_allowed(int g_nots, int* dev_nots)
{
dumb_rejecting (g_nots, dev_nots);
}
int main( int argc, char** argv)
{
int gs=8;
int* g_nots = NULL;
g_nots = new int[gs*gs];

int* dev_nots;
cudaMalloc((void **) &dev_nots, sizeof(int)*gs*gs);

dumb_not_allowed(g_nots,dev_nots);

}

AFAIK you canā€™t use cudaMallocHost just instead of cudaMalloc because cudaMallocHost allocates ā€œpinned memoryā€ on the host (RAM), not on the device (GPU)

To expand on thisā€¦ remember the CUDA device has its own memory space, any you manipulate that with cudaMalloc and cudaFree. When you do a computation, you copy data from the CPU to the GPU (using cudaMemcpy), run your calculation on the GPU, and then copy your results back to the CPU.[1] As explained in some other thread (I donā€™t have the reference to hand External Image ), cudaMemcpy does a DMA transfer from the CPU to the GPU. For this to work, itā€™s rather important that the memory on the CPU doesnā€™t get paged out mid-transfer. Memory allocated using malloc() or new is pageable, and hence unsuitable. What you need is pinned memory, which canā€™t be swapped. Internally, cudaMemcpy first copies the designated bytes to an internal buffer of pinned memory, and then DMAs that to the GPU.

This extra copy obviously incurs extra overhead, which would be nice to avoid. Enter cudaMallocHost. This allocates pinned memory on the CPU. More to the point, itā€™s pinned memory which the CUDA driver knows about. If you do a host-to-device transfer from memory allocated via cudaMallocHost, the CUDA library knows that the source memory is pinned, and so it does the DMA directly (skipping the copy to an internal buffer). This substantially increases the effective bandwidth to the GPU (a factor of two is typical).

[1] Neglecting ZeroCopy, introduced in CUDA 2.2. However, thatā€™s something youā€™ll probably want to neglect unless you really need it.

1 Like

Thank you for your replies. Reading them made something in my mind understand how I had to alter the code

I had to alter the allocation of ā€˜g_notsā€™ in the code above (and not ā€˜dev_notsā€™).

I changed

int* g_nots = NULL;
g_nots = new int[gs*gs];

TO

int* g_nots = NULL;
cudaMallocHost((void **) &g_nots, sizeof(int)gsgs);

The performance was almost twice as fast (as you predicted). Still not as fast as not using CUDA at all but at least I have some new understanding of how best to try and use CUDA.

Iā€™m still a little confused on how to use cudaMallocHost. I want to transfer a chunk of pinned memory on the the decvice so it can access it faster than global or shared memory. I had the same naive approach to just swap out cudaMalloc for cudaMallocHost. What am i doing wrong?

Iā€™ve include a code snippet of what iā€™m doing.

I am running on windows 7, tesla C2050, Cuda driver 4.2

...

        host_R1 = malloc(999*sizeof(float));

        host_R2 = malloc(999*sizeof(float));

...

	/* Pointers for the device memory */

	float *device_a;

        double *device_b;

	float *device_R1, *device_R2;	// Complex 

	/* Allocate memory on the device */

	cudaMalloc( (void **) &device_a, sizeof(float)*999);

	cudaMalloc( (void **) &device_b, sizeof(double)*999);

	cudaMalloc( (void **) &device_R1, sizeof(double)*512);

	cudaMalloc( (void **) &device_R2, sizeof(double)*512);

/* Allocate memory on the device as pinned */

//      cudaMallocHost( (void **) &device_a, sizeof(float)*999);

// 	cudaMallocHost( (void **) &device_b, sizeof(double)*999);

//     

// 	cudaMallocHost( (void **) &device_R1, sizeof(double)*512);

// 	cudaMallocHost( (void **) &device_r2, sizeof(double)*512);

/* Copy inputs to the device */

	cudaMemcpy(device_a, host_a, sizeof(float)*999, cudaMemcpyHostToDevice);

	cudaMemcpy(device_b, host_b, sizeof(double)*999, cudaMemcpyHostToDevice);

launchKernel<<<1,512>>>(device_a, device_b, ......, device_R1, device_R2);

/* Get the results back of the device */

        cudaMemcpy(host_R1, device_R1, sizeof(float)*512,cudaMemcpyDeviceToHost);

        cudaMemcpy(host_R2, device_R2, sizeof(float)*512,cudaMemcpyDeviceToHost);

/* Free all Cuda allocated memory */

	cudaFree(&device_a);

	cudaFree(&device_b);

	cudaFree(&device_R1);

	cudaFree(&device_R2);

return;

There is no pinned memory on the device, pinned memory is always located on the host. Or, to put it differently, device memory is always pinned because there is no mechanism to swap out device memory.

so what is it that i am thinking of then. I was under the impression it would be possible to transfer a chunk of data to pinned/page-locked memory on the device to make access faster accessing it from shared memory? or is this wrong?
if so, aside from concurrent access (copy, compute), what can i do to improve performance on my application. I/m currently using shared memory in the kernel to increase performance, but this is still insufficient.