Just starting out in CUDA… lots to learn, but having fun!
Two questions:
Part of my code has the host provide a 1K or so lookup table that’s copied to the device and used by the threads. It’s constant… but I’m not sure how to tell the device that it IS constant, or if it’d even matter.
CUDA provides the constant keyword, but none of the example projects use it even once, and it’s only mentioned in the programming guide in two sentences.
I thought it might be a type modifier on the kernel like:
__global__ void mykernel(__constant__ unsigned int *LookupTable,
unsigned int tablesize, unsigned int *results)
but that won’t even compile.
My code works fine with NO constant anywhere but obviously somehow CUDA can do some optimizations if I can just tell it that my input argument array is read-only…
Any ideas?
#2:
In my very small kernel, I tried making a small 4-entry lookup table for a thread to use:
__global__ void mykernel(unsigned int *results) {
unsigned int N=threadIdx.x
unsigned int table[4];
table[0]=N;
table[1]=N*N;
table[2]=table[1]*N;
table[3]=table[2]*N;
.... use table as a lookup, probably many times in loops
power=f(); // 0,1,2, or 3
x=x*table[power]; // multiply by a power of N
This code works, but it is incredibly painfully slow! Using if statements and explicit redundant multiplies is about 10X faster… even though we’re now doing more work AND adding branching!
There’s very little register use so there should be plenty of registers free… but
I suspect what’s happening is that the compiler can’t make an indexed array with register memory? So it’s using device memory and that drops speeds to Glacial Rates?
Thanks for any suggestions!