shared memory passed to a function. shared memory lose speed when referenced through a stack pointer
If I pass a pointer to shared memory to a function, then does the speed advantage of shared memory disappear if I dereference it with a regular function argument pointer?
If I pass a pointer to shared memory to a function, then does the speed advantage of shared memory disappear if I dereference it with a regular function argument pointer?

#1
Posted 04/22/2012 01:36 AM   
No, but there are some pitfalls with compute capability 1.x:

[list]
[*]Since there are only four offset registers available in hardware, using pointers in shared memory may result in additional address arithmetics and swapping of address registers. Disassemble your device code with [font="Courier New"]cuobjdump -sass[/font] to see if this is the case.
[*]You need to make sure the compiler can deduce that your pointers point to shared memory. This can be a bit tricky since there is no construct to inform the compiler.
[/list]
On compute capability 2.x and 3.0 these should be non-issues due to their generic addressing mode. I'm not entirely sure though about the use of offset registers and amount of address arithmetic instructions generated, as I have not yet analyzed as much 2.x code as 1.x code.
No, but there are some pitfalls with compute capability 1.x:




  • Since there are only four offset registers available in hardware, using pointers in shared memory may result in additional address arithmetics and swapping of address registers. Disassemble your device code with cuobjdump -sass to see if this is the case.
  • You need to make sure the compiler can deduce that your pointers point to shared memory. This can be a bit tricky since there is no construct to inform the compiler.


On compute capability 2.x and 3.0 these should be non-issues due to their generic addressing mode. I'm not entirely sure though about the use of offset registers and amount of address arithmetic instructions generated, as I have not yet analyzed as much 2.x code as 1.x code.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/22/2012 07:31 AM   
Scroll To Top