cudaMallocHost How to use
Can someone explain how to use cudaMallocHost? My code is working using cudaMalloc. Naively, I thought I could simply change cudaMalloc to cudaMallocHost and the code would still work. But it seems other changes are required. I have tried all I can think of. I need help.

Here are code snippets that show the way I am doing all calls. The code does exactly what I want in the kernel. All is fine, although too slow.
So I thought I would try to change cudaMalloc to cudaMallocHost to see what difference in speed there would be but then I get no results at all. The code compiles, I get no run-time errors but I know the kernel call RejectingonDevice does nothing or if it does the way I am coding the calls is faulty..

I have tried all I can think of. Should cudaMemcpyHostToDevice also be changed to cudaMemcpyHostToHost? Tried, did not help. I am sure some-one will know instantly what trivial thing I am doing wrong. I hope they can also tell me. Thanks. And maybe the answer will help others who are as naive as me.

All I want to do is change cudaMalloc to cudaMallocHost and be told what else to alter in code snippets below or maybe I do not understand something and it is not possible?. Thanks

void dumb_rejecting (int g_nots[], int* dev_nots)
{
cudaMemcpy(dev_nots, g_nots, gs*gs*sizeof(int), cudaMemcpyHostToDevice);

int blockSize = gs*gs; // always less than 512
int nBlocks = 1;

FunctionOnDevice <<< nBlocks, blockSize >>> (dev_nots);

cudaMemcpy(g_nots, dev_nots, gs*gs*sizeof(int), cudaMemcpyDeviceToHost);
}

int dumb_not_allowed(int g_nots[], int* dev_nots)
{
dumb_rejecting (g_nots, dev_nots);
}
int main( int argc, char** argv)
{
int gs=8;
int* g_nots = NULL;
g_nots = new int[gs*gs];

int* dev_nots;
cudaMalloc((void **) &dev_nots, sizeof(int)*gs*gs);

dumb_not_allowed(g_nots,dev_nots);
}
Can someone explain how to use cudaMallocHost? My code is working using cudaMalloc. Naively, I thought I could simply change cudaMalloc to cudaMallocHost and the code would still work. But it seems other changes are required. I have tried all I can think of. I need help.



Here are code snippets that show the way I am doing all calls. The code does exactly what I want in the kernel. All is fine, although too slow.

So I thought I would try to change cudaMalloc to cudaMallocHost to see what difference in speed there would be but then I get no results at all. The code compiles, I get no run-time errors but I know the kernel call RejectingonDevice does nothing or if it does the way I am coding the calls is faulty..



I have tried all I can think of. Should cudaMemcpyHostToDevice also be changed to cudaMemcpyHostToHost? Tried, did not help. I am sure some-one will know instantly what trivial thing I am doing wrong. I hope they can also tell me. Thanks. And maybe the answer will help others who are as naive as me.



All I want to do is change cudaMalloc to cudaMallocHost and be told what else to alter in code snippets below or maybe I do not understand something and it is not possible?. Thanks



void dumb_rejecting (int g_nots[], int* dev_nots)

{

cudaMemcpy(dev_nots, g_nots, gs*gs*sizeof(int), cudaMemcpyHostToDevice);



int blockSize = gs*gs; // always less than 512

int nBlocks = 1;



FunctionOnDevice <<< nBlocks, blockSize >>> (dev_nots);



cudaMemcpy(g_nots, dev_nots, gs*gs*sizeof(int), cudaMemcpyDeviceToHost);

}



int dumb_not_allowed(int g_nots[], int* dev_nots)

{

dumb_rejecting (g_nots, dev_nots);

}

int main( int argc, char** argv)

{

int gs=8;

int* g_nots = NULL;

g_nots = new int[gs*gs];



int* dev_nots;

cudaMalloc((void **) &dev_nots, sizeof(int)*gs*gs);



dumb_not_allowed(g_nots,dev_nots);

}

#1
Posted 05/13/2009 11:13 AM   
AFAIK you can't use cudaMallocHost just instead of cudaMalloc because cudaMallocHost allocates "pinned memory" on the host (RAM), not on the device (GPU)
AFAIK you can't use cudaMallocHost just instead of cudaMalloc because cudaMallocHost allocates "pinned memory" on the host (RAM), not on the device (GPU)

#2
Posted 05/13/2009 12:21 PM   
To expand on this.... remember the CUDA device has its own memory space, any you manipulate that with cudaMalloc and cudaFree. When you do a computation, you copy data from the CPU to the GPU (using cudaMemcpy), run your calculation on the GPU, and then copy your results back to the CPU.[1] As explained in some other thread (I don't have the reference to hand /whistling.gif' class='bbc_emoticon' alt=':whistling:' /> ), cudaMemcpy does a DMA transfer from the CPU to the GPU. For this to work, it's rather important that the memory on the CPU doesn't get paged out mid-transfer. Memory allocated using malloc() or new is pageable, and hence unsuitable. What you need is pinned memory, which can't be swapped. Internally, cudaMemcpy first copies the designated bytes to an internal buffer of pinned memory, and then DMAs that to the GPU.

This extra copy obviously incurs extra overhead, which would be nice to avoid. Enter cudaMallocHost. This allocates pinned memory on the CPU. More to the point, it's pinned memory which the CUDA driver knows about. If you do a host-to-device transfer from memory allocated via cudaMallocHost, the CUDA library knows that the source memory is pinned, and so it does the DMA directly (skipping the copy to an internal buffer). This substantially increases the effective bandwidth to the GPU (a factor of two is typical).

[1] Neglecting ZeroCopy, introduced in CUDA 2.2. However, that's something you'll probably want to neglect unless you really need it.
To expand on this.... remember the CUDA device has its own memory space, any you manipulate that with cudaMalloc and cudaFree. When you do a computation, you copy data from the CPU to the GPU (using cudaMemcpy), run your calculation on the GPU, and then copy your results back to the CPU.[1] As explained in some other thread (I don't have the reference to hand /whistling.gif' class='bbc_emoticon' alt=':whistling:' /> ), cudaMemcpy does a DMA transfer from the CPU to the GPU. For this to work, it's rather important that the memory on the CPU doesn't get paged out mid-transfer. Memory allocated using malloc() or new is pageable, and hence unsuitable. What you need is pinned memory, which can't be swapped. Internally, cudaMemcpy first copies the designated bytes to an internal buffer of pinned memory, and then DMAs that to the GPU.



This extra copy obviously incurs extra overhead, which would be nice to avoid. Enter cudaMallocHost. This allocates pinned memory on the CPU. More to the point, it's pinned memory which the CUDA driver knows about. If you do a host-to-device transfer from memory allocated via cudaMallocHost, the CUDA library knows that the source memory is pinned, and so it does the DMA directly (skipping the copy to an internal buffer). This substantially increases the effective bandwidth to the GPU (a factor of two is typical).



[1] Neglecting ZeroCopy, introduced in CUDA 2.2. However, that's something you'll probably want to neglect unless you really need it.

#3
Posted 05/13/2009 04:37 PM   
Thank you for your replies. Reading them made something in my mind understand how I had to alter the code

I had to alter the allocation of 'g_nots' in the code above (and not 'dev_nots').

I changed

int* g_nots = NULL;
g_nots = new int[gs*gs];

TO

int* g_nots = NULL;
cudaMallocHost((void **) &g_nots, sizeof(int)*gs*gs);

The performance was almost twice as fast (as you predicted). Still not as fast as not using CUDA at all but at least I have some new understanding of how best to try and use CUDA.
Thank you for your replies. Reading them made something in my mind understand how I had to alter the code



I had to alter the allocation of 'g_nots' in the code above (and not 'dev_nots').



I changed



int* g_nots = NULL;

g_nots = new int[gs*gs];



TO



int* g_nots = NULL;

cudaMallocHost((void **) &g_nots, sizeof(int)*gs*gs);



The performance was almost twice as fast (as you predicted). Still not as fast as not using CUDA at all but at least I have some new understanding of how best to try and use CUDA.

#4
Posted 05/14/2009 12:01 AM   
I'm still a little confused on how to use cudaMallocHost. I want to transfer a chunk of pinned memory on the the decvice so it can access it faster than global or shared memory. I had the same naive approach to just swap out cudaMalloc for cudaMallocHost. What am i doing wrong?
I've include a code snippet of what i'm doing.
I am running on windows 7, tesla C2050, Cuda driver 4.2

[code]
...
host_R1 = malloc(999*sizeof(float));
host_R2 = malloc(999*sizeof(float));

...

/* Pointers for the device memory */
float *device_a;
double *device_b;
float *device_R1, *device_R2; // Complex

/* Allocate memory on the device */
cudaMalloc( (void **) &device_a, sizeof(float)*999);
cudaMalloc( (void **) &device_b, sizeof(double)*999);

cudaMalloc( (void **) &device_R1, sizeof(double)*512);
cudaMalloc( (void **) &device_R2, sizeof(double)*512);

/* Allocate memory on the device as pinned */
// cudaMallocHost( (void **) &device_a, sizeof(float)*999);
// cudaMallocHost( (void **) &device_b, sizeof(double)*999);
//
// cudaMallocHost( (void **) &device_R1, sizeof(double)*512);
// cudaMallocHost( (void **) &device_r2, sizeof(double)*512);


/* Copy inputs to the device */
cudaMemcpy(device_a, host_a, sizeof(float)*999, cudaMemcpyHostToDevice);
cudaMemcpy(device_b, host_b, sizeof(double)*999, cudaMemcpyHostToDevice);

launchKernel<<<1,512>>>(device_a, device_b, ......, device_R1, device_R2);

/* Get the results back of the device */
cudaMemcpy(host_R1, device_R1, sizeof(float)*512,cudaMemcpyDeviceToHost);
cudaMemcpy(host_R2, device_R2, sizeof(float)*512,cudaMemcpyDeviceToHost);


/* Free all Cuda allocated memory */
cudaFree(&device_a);
cudaFree(&device_b);

cudaFree(&device_R1);
cudaFree(&device_R2);

return;


[/code]
I'm still a little confused on how to use cudaMallocHost. I want to transfer a chunk of pinned memory on the the decvice so it can access it faster than global or shared memory. I had the same naive approach to just swap out cudaMalloc for cudaMallocHost. What am i doing wrong?

I've include a code snippet of what i'm doing.

I am running on windows 7, tesla C2050, Cuda driver 4.2





...

host_R1 = malloc(999*sizeof(float));

host_R2 = malloc(999*sizeof(float));



...



/* Pointers for the device memory */

float *device_a;

double *device_b;

float *device_R1, *device_R2; // Complex



/* Allocate memory on the device */

cudaMalloc( (void **) &device_a, sizeof(float)*999);

cudaMalloc( (void **) &device_b, sizeof(double)*999);



cudaMalloc( (void **) &device_R1, sizeof(double)*512);

cudaMalloc( (void **) &device_R2, sizeof(double)*512);



/* Allocate memory on the device as pinned */

// cudaMallocHost( (void **) &device_a, sizeof(float)*999);

// cudaMallocHost( (void **) &device_b, sizeof(double)*999);

//

// cudaMallocHost( (void **) &device_R1, sizeof(double)*512);

// cudaMallocHost( (void **) &device_r2, sizeof(double)*512);





/* Copy inputs to the device */

cudaMemcpy(device_a, host_a, sizeof(float)*999, cudaMemcpyHostToDevice);

cudaMemcpy(device_b, host_b, sizeof(double)*999, cudaMemcpyHostToDevice);



launchKernel<<<1,512>>>(device_a, device_b, ......, device_R1, device_R2);



/* Get the results back of the device */

cudaMemcpy(host_R1, device_R1, sizeof(float)*512,cudaMemcpyDeviceToHost);

cudaMemcpy(host_R2, device_R2, sizeof(float)*512,cudaMemcpyDeviceToHost);





/* Free all Cuda allocated memory */

cudaFree(&device_a);

cudaFree(&device_b);



cudaFree(&device_R1);

cudaFree(&device_R2);



return;




Physics is like sex,

Sure it has practical applications, but thats not why we do it.

- R. Feynman.

#5
Posted 04/26/2012 06:46 AM   
There is no pinned memory on the device, pinned memory is always located on the host. Or, to put it differently, device memory is always pinned because there is no mechanism to swap out device memory.
There is no pinned memory on the device, pinned memory is always located on the host. Or, to put it differently, device memory is always pinned because there is no mechanism to swap out device memory.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 04/26/2012 09:33 AM   
so what is it that i am thinking of then. I was under the impression it would be possible to transfer a chunk of data to pinned/page-locked memory on the device to make access faster accessing it from shared memory? or is this wrong?
if so, aside from concurrent access (copy, compute), what can i do to improve performance on my application. I/m currently using shared memory in the kernel to increase performance, but this is still insufficient.
so what is it that i am thinking of then. I was under the impression it would be possible to transfer a chunk of data to pinned/page-locked memory on the device to make access faster accessing it from shared memory? or is this wrong?

if so, aside from concurrent access (copy, compute), what can i do to improve performance on my application. I/m currently using shared memory in the kernel to increase performance, but this is still insufficient.

Physics is like sex,

Sure it has practical applications, but thats not why we do it.

- R. Feynman.

#7
Posted 04/26/2012 11:43 PM   
Scroll To Top