__constant__ use
Just starting out in CUDA... lots to learn, but having fun!

Two questions:

Part of my code has the host provide a 1K or so lookup table that's copied to the device and used by the threads. It's constant.. but I'm not sure how to tell the device that it IS constant, or if it'd even matter.

CUDA provides the __constant__ keyword, but none of the example projects use it even once, and it's only mentioned in the programming guide in two sentences.

I thought it might be a type modifier on the kernel like:
[code]
__global__ void mykernel(__constant__ unsigned int *LookupTable,
    unsigned int tablesize,  unsigned int *results)
[/code]
but that won't even compile.

My code works fine with NO __constant__ anywhere but obviously somehow CUDA can do some optimizations if I can just tell it that my input argument array is read-only..

Any ideas?


#2:
In my very small kernel, I tried making a small 4-entry lookup table for a thread to use:

[code]
   __global__ void mykernel(unsigned int *results)  {
unsigned int N=threadIdx.x
unsigned int table[4];
table[0]=N;
table[1]=N*N;
table[2]=table[1]*N;
table[3]=table[2]*N;

.... use table as a lookup, probably many times in loops  
power=f(); // 0,1,2, or 3
x=x*table[power]; // multiply by a power of N

[/code]

This code works, but it is incredibly painfully slow! Using if statements and explicit redundant multiplies is about 10X faster.. even though we're now doing more work AND adding branching!

There's very little register use so there should be plenty of registers free.. but
I suspect what's happening is that the compiler can't make an indexed array with register memory? So it's using device memory and that drops speeds to Glacial Rates?


Thanks for any suggestions!
Just starting out in CUDA... lots to learn, but having fun!



Two questions:



Part of my code has the host provide a 1K or so lookup table that's copied to the device and used by the threads. It's constant.. but I'm not sure how to tell the device that it IS constant, or if it'd even matter.



CUDA provides the __constant__ keyword, but none of the example projects use it even once, and it's only mentioned in the programming guide in two sentences.



I thought it might be a type modifier on the kernel like:



__global__ void mykernel(__constant__ unsigned int *LookupTable,

    unsigned int tablesize,  unsigned int *results)


but that won't even compile.



My code works fine with NO __constant__ anywhere but obviously somehow CUDA can do some optimizations if I can just tell it that my input argument array is read-only..



Any ideas?





#2:

In my very small kernel, I tried making a small 4-entry lookup table for a thread to use:





   __global__ void mykernel(unsigned int *results)  {

unsigned int N=threadIdx.x

unsigned int table[4];

table[0]=N;

table[1]=N*N;

table[2]=table[1]*N;

table[3]=table[2]*N;



.... use table as a lookup, probably many times in loops  

power=f(); // 0,1,2, or 3

x=x*table[power]; // multiply by a power of N






This code works, but it is incredibly painfully slow! Using if statements and explicit redundant multiplies is about 10X faster.. even though we're now doing more work AND adding branching!



There's very little register use so there should be plenty of registers free.. but

I suspect what's happening is that the compiler can't make an indexed array with register memory? So it's using device memory and that drops speeds to Glacial Rates?





Thanks for any suggestions!

#1
Posted 06/13/2008 12:56 PM   
The GPU has so much computing horsepower compared to memory bandwidth, look up tables are often not the best for performance. Branching rarely impacts performance signifcantly in practice.

As mentioned in the manual __constant__ variables are implicit static and must be declared at file scope. The correct syntax for a lookup table is:
[code]
__constant__ unsigned int c_table[1000];

__global__ void kernel(...)
   {
   // use c_table here
   }

void setupTable
   {
   unsigned int d_table[1000];
   // generate data
   cudaMemcpyToSymbol(...); // copy data to __constant__ memory
   }
[/code]

The reason your per thread lookup table is slow is because a per thread array will use local memory and not fast registers.
The GPU has so much computing horsepower compared to memory bandwidth, look up tables are often not the best for performance. Branching rarely impacts performance signifcantly in practice.



As mentioned in the manual __constant__ variables are implicit static and must be declared at file scope. The correct syntax for a lookup table is:



__constant__ unsigned int c_table[1000];



__global__ void kernel(...)

   {

   // use c_table here

   }



void setupTable

   {

   unsigned int d_table[1000];

   // generate data

   cudaMemcpyToSymbol(...); // copy data to __constant__ memory

   }




The reason your per thread lookup table is slow is because a per thread array will use local memory and not fast registers.

#2
Posted 06/13/2008 01:38 PM   
[quote name='MisterAnderson42' date='Jun 13 2008, 05:38 AM']The GPU has so much computing horsepower compared to memory bandwidth, look up tables are often not the best for performance. Branching rarely impacts performance signifcantly in practice.

As mentioned in the manual __constant__ variables are implicit static and must be declared at file scope. The correct syntax for a lookup table is:
[/quote]

Thanks (yet again) for quick and useful answers.
The __constant__ example is exactly what I was looking for!



[quote]The reason your per thread lookup table is slow is because a per thread array will use local memory and not fast registers.
[/quote]
Ouch. This is unpleasant, since the small (4 or 8 entry) "local cache" trick is very useful to me, each local precompute is actually about 1000 math and branch ops, not the simple multiply I showed as an example. The workaround I guess is to just use 4 explicit registers with 3 branch tests for every lookup.

Thanks again very much for the help!
[quote name='MisterAnderson42' date='Jun 13 2008, 05:38 AM']The GPU has so much computing horsepower compared to memory bandwidth, look up tables are often not the best for performance. Branching rarely impacts performance signifcantly in practice.



As mentioned in the manual __constant__ variables are implicit static and must be declared at file scope. The correct syntax for a lookup table is:





Thanks (yet again) for quick and useful answers.

The __constant__ example is exactly what I was looking for!







The reason your per thread lookup table is slow is because a per thread array will use local memory and not fast registers.



Ouch. This is unpleasant, since the small (4 or 8 entry) "local cache" trick is very useful to me, each local precompute is actually about 1000 math and branch ops, not the simple multiply I showed as an example. The workaround I guess is to just use 4 explicit registers with 3 branch tests for every lookup.



Thanks again very much for the help!

#3
Posted 06/14/2008 12:48 AM   
If you aren't using shared memory for anything else in that kernel, you can easily stuff a small local LUT into it. It would probably always be work bencmarking vs a version of code that used branches as shared memory use can change occupancy and potentially drastically change performance.
If you aren't using shared memory for anything else in that kernel, you can easily stuff a small local LUT into it. It would probably always be work bencmarking vs a version of code that used branches as shared memory use can change occupancy and potentially drastically change performance.

#4
Posted 06/14/2008 12:56 AM   
[quote name='MisterAnderson42' date='Jun 13 2008, 04:56 PM']If you aren't using shared memory for anything else in that kernel, you can easily stuff a small local LUT into it. It would probably always be work bencmarking vs a version of code that used branches as shared memory use can change occupancy and potentially drastically change performance.
[right][snapback]393297[/snapback][/right]
[/quote]

I'm not using any shared memory at all, and just a handful of registers, so it's very feasible... my per-thread cache only needs to be 4 or 8 entries long.

Is there any speed difference at all between use of local memory and shared memory?
I thought they came from the same "pool" and have the same speed penalties for accessing. I should just try it to see I guess.
[quote name='MisterAnderson42' date='Jun 13 2008, 04:56 PM']If you aren't using shared memory for anything else in that kernel, you can easily stuff a small local LUT into it. It would probably always be work bencmarking vs a version of code that used branches as shared memory use can change occupancy and potentially drastically change performance.

[snapback]393297[/snapback]






I'm not using any shared memory at all, and just a handful of registers, so it's very feasible... my per-thread cache only needs to be 4 or 8 entries long.



Is there any speed difference at all between use of local memory and shared memory?

I thought they came from the same "pool" and have the same speed penalties for accessing. I should just try it to see I guess.

#5
Posted 06/14/2008 01:08 AM   
local memory actually sits in device memory (the same place as global), so it is subject to the 70 GiB/s bandwidth limit (8800 GTX), assuming the acceses are coalesced (which isn't likely with many threads accessing different elements in the LUT).

Shared memory is on chip and can be accessed in one or two clocks. I forget. It's basically as fast as a register, though a cubin expert might correct me on this. Performance penalties for multiple threads accessing different elements in the LUT will be minimal.
local memory actually sits in device memory (the same place as global), so it is subject to the 70 GiB/s bandwidth limit (8800 GTX), assuming the acceses are coalesced (which isn't likely with many threads accessing different elements in the LUT).



Shared memory is on chip and can be accessed in one or two clocks. I forget. It's basically as fast as a register, though a cubin expert might correct me on this. Performance penalties for multiple threads accessing different elements in the LUT will be minimal.

#6
Posted 06/14/2008 01:14 AM   
[quote name='MisterAnderson42' date='Jun 13 2008, 05:14 PM']local memory actually sits in device memory (the same place as global), so it is subject to the 70 GiB/s bandwidth limit (8800 GTX), assuming the acceses are coalesced (which isn't likely with many threads accessing different elements in the LUT).

Shared memory is on chip and can be accessed in one or two clocks. I forget. It's basically as fast as a register, though a cubin expert might correct me on this. Performance penalties for multiple threads accessing different elements in the LUT will be minimal.
[right][snapback]393302[/snapback][/right]
[/quote]

Wow, this gets complex.. kind of a fun puzzle!
It's interesting that "Local" memory would be slower than shared memory..
even in the memory heirarchy diagram (figure 2.2 in the programming guide) it feels like local memory is "closer" to the thread.. but of course the docs don't say so explicitly so that was just an assumption on my part.

Thanks YET AGAIN for your clarification, I have lots to experiment with now!
[quote name='MisterAnderson42' date='Jun 13 2008, 05:14 PM']local memory actually sits in device memory (the same place as global), so it is subject to the 70 GiB/s bandwidth limit (8800 GTX), assuming the acceses are coalesced (which isn't likely with many threads accessing different elements in the LUT).



Shared memory is on chip and can be accessed in one or two clocks. I forget. It's basically as fast as a register, though a cubin expert might correct me on this. Performance penalties for multiple threads accessing different elements in the LUT will be minimal.

[snapback]393302[/snapback]






Wow, this gets complex.. kind of a fun puzzle!

It's interesting that "Local" memory would be slower than shared memory..

even in the memory heirarchy diagram (figure 2.2 in the programming guide) it feels like local memory is "closer" to the thread.. but of course the docs don't say so explicitly so that was just an assumption on my part.



Thanks YET AGAIN for your clarification, I have lots to experiment with now!

#7
Posted 06/14/2008 01:54 AM   
Scroll To Top