__constant__ and __device__ memory access
Hi all,

I'm studying CUDA for my thesis and I have a little problem with the way to access __constant__ and __device__ memory from the host.

Basically, my problem is to well understand what is the difference between these two functions:

[list]
[*][code]cudaError_t cudaMemcpyFromSymbol(void ∗ dst, const char ∗ symbol, size_t count, size_t offset = 0, enum cudaMemcpyKind kind = cudaMemcpyDeviceToHost)[/code]
[*][code]cudaError_t cudaGetSymbolAddress(void ∗∗ devPtr, const char ∗ symbol)[/code]
[/list]

Suppose we have some variable in constant or global memory, declared with __constant__ and/or __device__ qualficators. Reading the programming guide, we can find that these variable are [quote]accessible [...] from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime API and cuModuleGetGlobal() for the driver API).[/quote]

If I'm not wrong I've understand that, since these variable are on device, host can't use them directly, but it has to transfer data from device memory to host memory.
First function does exactly what I expect: copies count byte from symbol to dst. If symbol is a __device__ or __constant__ variable and dst is a pointer to host memory space, in dst I have a copy of symbol and I can read and manage the data.
Second function seems to do the same thing, masking the process of copy of the data. I have only to specify the symbol and the pointer in host memory space, and all work like previous functions.

So, why we have two distinct functions that do the same thing? When I have to use the first, and when the second?

Thanks a lot!
Hi all,



I'm studying CUDA for my thesis and I have a little problem with the way to access __constant__ and __device__ memory from the host.



Basically, my problem is to well understand what is the difference between these two functions:




  • cudaError_t cudaMemcpyFromSymbol(void ∗ dst, const char ∗ symbol, size_t count, size_t offset = 0, enum cudaMemcpyKind kind = cudaMemcpyDeviceToHost)

  • cudaError_t cudaGetSymbolAddress(void ∗∗ devPtr, const char ∗ symbol)





Suppose we have some variable in constant or global memory, declared with __constant__ and/or __device__ qualficators. Reading the programming guide, we can find that these variable are
accessible [...] from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime API and cuModuleGetGlobal() for the driver API).




If I'm not wrong I've understand that, since these variable are on device, host can't use them directly, but it has to transfer data from device memory to host memory.

First function does exactly what I expect: copies count byte from symbol to dst. If symbol is a __device__ or __constant__ variable and dst is a pointer to host memory space, in dst I have a copy of symbol and I can read and manage the data.

Second function seems to do the same thing, masking the process of copy of the data. I have only to specify the symbol and the pointer in host memory space, and all work like previous functions.



So, why we have two distinct functions that do the same thing? When I have to use the first, and when the second?



Thanks a lot!

#1
Posted 04/10/2012 10:13 AM   
cudaGetSymbolAddress() just gives you the address of the variable, it does not copy any data. cudaMemcpyFromSymbol() corresponds to cudaGetSymbolAddress() followed by cudaMemcpy() from the address returned.
cudaGetSymbolAddress() just gives you the address of the variable, it does not copy any data. cudaMemcpyFromSymbol() corresponds to cudaGetSymbolAddress() followed by cudaMemcpy() from the address returned.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/10/2012 11:02 AM   
[quote name='tera' date='10 April 2012 - 12:02 PM' timestamp='1334055734' post='1394189']
cudaGetSymbolAddress() just gives you the address of the variable, it does not copy any data. cudaMemcpyFromSymbol() corresponds to cudaGetSymbolAddress() followed by cudaMemcpy() from the address returned.
[/quote]

Ok, so the address returned is in device address space right? And what can I do with this? I can't read content of the variable through this address right?
Can someone give me an example of usage of cudaGetSymbolAddress?
Thanks.
[quote name='tera' date='10 April 2012 - 12:02 PM' timestamp='1334055734' post='1394189']

cudaGetSymbolAddress() just gives you the address of the variable, it does not copy any data. cudaMemcpyFromSymbol() corresponds to cudaGetSymbolAddress() followed by cudaMemcpy() from the address returned.





Ok, so the address returned is in device address space right? And what can I do with this? I can't read content of the variable through this address right?

Can someone give me an example of usage of cudaGetSymbolAddress?

Thanks.

#3
Posted 04/10/2012 11:58 AM   
[quote name='canemacchina' date='10 April 2012 - 06:58 AM' timestamp='1334059083' post='1394210']
Ok, so the address returned is in device address space right? And what can I do with this? I can't read content of the variable through this address right?
Can someone give me an example of usage of cudaGetSymbolAddress?
Thanks.
[/quote]
You cannot dereference a device pointer on the host. But you can use cudaMemcpy to copy memory to/from the pointer.
[code]
__device__ float g_d_data;
.........
float *d_data;
cudaGetSymbolAddress(&d_data, "g_d_data");
cudaMemcpy(d_data, h_data, cudaMemcpyDefault);
[/code]

Generally, for device data it is prefereable to use cudaMalloc to dynamically allocate the pointer. That way you can control when memory is allocated and freed and your code is more reusable because isn't tied to specific global variables. There's no way to dynamically allocate __constant__ memory, though. However, one can argue that with the L1/L2 cache hierarchy, __constant__ memory is rarely needed anymore.
[quote name='canemacchina' date='10 April 2012 - 06:58 AM' timestamp='1334059083' post='1394210']

Ok, so the address returned is in device address space right? And what can I do with this? I can't read content of the variable through this address right?

Can someone give me an example of usage of cudaGetSymbolAddress?

Thanks.



You cannot dereference a device pointer on the host. But you can use cudaMemcpy to copy memory to/from the pointer.



__device__ float g_d_data;

.........

float *d_data;

cudaGetSymbolAddress(&d_data, "g_d_data");

cudaMemcpy(d_data, h_data, cudaMemcpyDefault);




Generally, for device data it is prefereable to use cudaMalloc to dynamically allocate the pointer. That way you can control when memory is allocated and freed and your code is more reusable because isn't tied to specific global variables. There's no way to dynamically allocate __constant__ memory, though. However, one can argue that with the L1/L2 cache hierarchy, __constant__ memory is rarely needed anymore.

#4
Posted 04/10/2012 12:27 PM   
[quote name='DrAnderson42' date='10 April 2012 - 01:27 PM' timestamp='1334060863' post='1394228']
You cannot dereference a device pointer on the host. But you can use cudaMemcpy to copy memory to/from the pointer.
[code]
__device__ float g_d_data;
.........
float *d_data;
cudaGetSymbolAddress(&d_data, "g_d_data");
cudaMemcpy(d_data, h_data, cudaMemcpyDefault);
[/code]

Generally, for device data it is prefereable to use cudaMalloc to dynamically allocate the pointer. That way you can control when memory is allocated and freed and your code is more reusable because isn't tied to specific global variables. There's no way to dynamically allocate __constant__ memory, though. However, one can argue that with the L1/L2 cache hierarchy, __constant__ memory is rarely needed anymore.
[/quote]

Ok thanks. But I can't figure out a scenario where is a good thing the use of cudaGetSymbolAddress...
[quote name='DrAnderson42' date='10 April 2012 - 01:27 PM' timestamp='1334060863' post='1394228']

You cannot dereference a device pointer on the host. But you can use cudaMemcpy to copy memory to/from the pointer.



__device__ float g_d_data;

.........

float *d_data;

cudaGetSymbolAddress(&d_data, "g_d_data");

cudaMemcpy(d_data, h_data, cudaMemcpyDefault);




Generally, for device data it is prefereable to use cudaMalloc to dynamically allocate the pointer. That way you can control when memory is allocated and freed and your code is more reusable because isn't tied to specific global variables. There's no way to dynamically allocate __constant__ memory, though. However, one can argue that with the L1/L2 cache hierarchy, __constant__ memory is rarely needed anymore.





Ok thanks. But I can't figure out a scenario where is a good thing the use of cudaGetSymbolAddress...

#5
Posted 04/10/2012 12:38 PM   
Scroll To Top