Question about optimal cuRAND() use

CudaaduC · April 26, 2015, 11:22pm

While profiling a Monte Carlo simulation I have noticed that the initial setup of the random number states takes much longer than expected.

Initially I seed with a 64 bit unsigned integer and call this kernel with fills in 1e6 values:

__global__ void setup_rand_states(
				curandState *rngState,
				const int num_to_do,
				const unsigned long long seed){

    int tid = threadIdx.x + blockIdx.x * blockDim.x;
	if(tid<num_to_do){
		curand_init(seed, tid, 0, &rngState[tid]);
	}
}

extern "C" void setup_rand_states_wrap(
			curandState *rngState,
			const int num_to_do,
			const unsigned long long seed){
	dim3 grid((num_to_do+THREADS_BIG-1)/THREADS_BIG,1,1);

	setup_rand_states<<<grid,THREADS_BIG>>>(rngState,num_to_do,seed);
	cudaError_t err=cudaThreadSynchronize();
	if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
}

That step alone takes 3 seconds on a GTX Titan X which is can be longer than the rest of the simulation.

Rather than setting 1e6 states and using one per thread, could I rather set a seed per thread block and store in shared memory?

So rather than doing this:

curandState localState = rngStates[tid];//set up random number generator for this specific thread

and using per thread like this:

theta=TWO_PI*curand_uniform(&localState);

Instead I move to something like this:

__shared__ curandState localState;

	if(threadIdx.x==0){
		localState = rngStates[blockIdx.x];
	}
	__syncthreads();

Then have all threads in a block use that state for their uniform random number generation.

Is there a fundamental problem with that approach?

CudaaduC · April 27, 2015, 12:26am

hmm, I already see problems with that approach…

But in general I find that the frequent calls to curand_uniform() are the bottleneck in the application. Any tricks to speed that up?

njuffa · April 27, 2015, 2:19am

Are you sure the way you initialize for each thread is as intended by CURAND design? Have you cross-check with the CURAND examples?

I do not have experience with CURAND but using a different seed per thread sounds like a bad idea, or at least suboptimal. In general, for parallel random number generation, one desires independenc between the sequences for each thread, i.e. they are not correlated. The typical approach to that is to partition a long sequence into non-overlapping portions, one portion per thread. The sequences may be interleaved (leap frogging), or contiguous. Use of different seeds does not guarantee lack of overlap between sequences for different threads.

As for performance, you may want to try one of the cheaper PRNGs, such as Philox. From what I understand, Philox belongs to a class of algorithms that use an encryption algorithm to encode a linear sequence of integers into pseudo-random numbers. Twenty years ago, people would use DES with reduced round count for this. Not sure what Philox uses, AES maybe? The advantage of this approach is that the advancing state boils down to integer additions, and should be very fast.

With many other PRNGS, advancing the state to a particular point in the sequence involves matrix exponentiation, which can be relatively costly.

CudaaduC · April 27, 2015, 2:47am

I do use the same seed for the curand_init() kernel, similar to this Nvidia CURAND example:

__global__ void setup_kernel(curandState *state) { 
     int id = threadIdx.x + blockIdx.x * blockDim.x; // Each thread gets same seed, a different sequence 
     curand_init(1234, id, 0, &state[id]);
 }

In this case they use the hardcoded value ‘1234’ while my version takes in a single unsigned 64 bit integer from the host and uses that instead.

In the end the number of random numbers which will be used by each thread can vary quite a bit, so cannot be completely pre-generated and saved in global memory. A distinct thread may need to generate from 10 to 20,000 uniform random numbers through the simulation process.

This current approach generates 1e6 curandStates and saves those in global memory. That state is loaded into thread-local register memory by the simluation kernel and each participating thread invokes curand_uniform() with that ‘unique’ state to generate all the values required by that thread.

for 1e6 items being simulated the ‘setup state’ kernel takes 3 seconds, while the entire simulation takes about 2 seconds including a great deal of global memory operations.

njuffa · April 27, 2015, 3:21am

You don’t need to know the exact number of random numbers generated to separate the subsequences. For a long sequence, you can simply partition it into N subsequences, one for each thread. The subsequences would be spaced 2**64 or so apart, probably more for the complex PRNGs with massive state.

Your sequences of up to 20,000 numbers per thread are very, very short. Each number may only require 20 or so instructions, but skipping to the starting point of the subsequences may require matrix exponentiation followed by matrix-vector multiply, requiring potentially thousands of instructions. So state skipping starts to dominate the total PRNG run time.

As I said, I would suggest trying the Philox generator due to the cheap state handling and overall high performance.

If you want to try something hokey but fast, try this (experiment only, do not use for production code). Use George Marsaglia’s KISS generator in each thread:

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

This generator uses only 128 bit of state, comprising four 32-bit ‘unsigned int’ variables z, w, jsr, jcong. One of the constituent parts is a LFSR, which can be initialized differently on a per-thread basis, except the starting state cannot be 0. To minimize (not eliminate!) the chance of getting overlapping sequences, we can use a mix function to generate the LFSR starting state ‘jsr’ from the thread ID and the block ID. A suitable mixing function ‘mix()’ with very good avalanching properties can be found here: http://www.burtleburtle.net/bob/hash/doobs.html

#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8); \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12);  \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5); \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

unsigned int z=362436069, w=521288629, jcong=123456789; // initialize MWC and congruential generators
unsigned int block_ID = [...];
unsigned int thread_ID = [...];
unsigned int jsr = 0x96f563ae;  // magic number of your choice
mix (block_ID, thread_ID, jsr);
jsr |= 1; // must avoid initial state of 0

CudaaduC · April 27, 2015, 4:16am

Wow, those are great resources, Thanks!

I do see that the Philox generator is available via cuRAND, so will give that a go on your recommendation.

CudaaduC · April 27, 2015, 4:51am

Switching to the Philox generator dramatically reduced the first setup call, but (very modestly) increased the overall simulation time.

before with default curstate type:

Time rand gen= 3.006000 

Time MC kernel = 1.763000 

Time debug kernel = 0.000000 

Tot detected= 719594, Tot timed out= 143672

with curandStatePhilox4_32_10_t type:

Time rand gen= 0.001000 

Time MC kernel = 1.821000 

Time debug kernel = 0.000000 

Tot detected= 726055, Tot timed out= 138977

njuffa · April 27, 2015, 8:08am

I think CURAND defaults to the XORWOW generator? If so, that would explain your observations, as XORWOW is the fastest of the generators offered by CURAND, but it does have fairly expensive skip-ahead. Best I know Philox is the second-fastest generator, followed by MRG32k3a. I believe in terms of skip-ahead, the fastest is Philox, followed by MRG32k3a, followed by XORWOW. All this is from memory, so you may want to check this in the context of your particular use case. The reason I suggested Philox is because it seemed to me it would offer the minimum combined time for startup (which involves skip-ahead) and generation.

I gave you the somewhat hokey option of a poorly parallelized version (in terms of quality) of a well regarded PRNG from my serial computing days so you could get yet another data point for speed-of light of uniform random number generation. Honestly I do not know how the parallelized KISS generator fares against any of the CURAND generators.

I looked at my DES-based PRNG code from about 1992 and that looks like it would easily run at only 1/10 the speed of KISS. It does have the advantage that by choosing a different key for each thread it would provide a non-correlated stream of random numbers for each thread. From that perspective, Philox, which is also derived from an encryption algorithm, may offer a fairly “optimal” combination of speed and robustness for applications that do not need massive amounts of random numbers.