While profiling a Monte Carlo simulation I have noticed that the initial setup of the random number states takes much longer than expected.
Initially I seed with a 64 bit unsigned integer and call this kernel with fills in 1e6 values:
__global__ void setup_rand_states(
curandState *rngState,
const int num_to_do,
const unsigned long long seed){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<num_to_do){
curand_init(seed, tid, 0, &rngState[tid]);
}
}
extern "C" void setup_rand_states_wrap(
curandState *rngState,
const int num_to_do,
const unsigned long long seed){
dim3 grid((num_to_do+THREADS_BIG-1)/THREADS_BIG,1,1);
setup_rand_states<<<grid,THREADS_BIG>>>(rngState,num_to_do,seed);
cudaError_t err=cudaThreadSynchronize();
if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
}
That step alone takes 3 seconds on a GTX Titan X which is can be longer than the rest of the simulation.
Rather than setting 1e6 states and using one per thread, could I rather set a seed per thread block and store in shared memory?
So rather than doing this:
curandState localState = rngStates[tid];//set up random number generator for this specific thread
and using per thread like this:
theta=TWO_PI*curand_uniform(&localState);
Instead I move to something like this:
__shared__ curandState localState;
if(threadIdx.x==0){
localState = rngStates[blockIdx.x];
}
__syncthreads();
Then have all threads in a block use that state for their uniform random number generation.
Is there a fundamental problem with that approach?