Investigating RNGs for large numbers of parallel streams?

danuk · January 19, 2011, 12:46pm

I’m currently investigating peusdo random number generators for use with particularly large numbers of parallel streams.

The Mersenne Twister appears to be the best available and the adaptation of Matsumoto’s MTGP in the CUDA SDK examples seems to work well for 4096 different parallel streams.

However, I need one that can feasibly generate up to 10 million streams of parallel numbers or more and I’m concerned about the memory usage required to store all of the parameters for each stream. I don’t need the most accurate generator though and I don’t even necessarily need the streams to use different characteristic polynomials as produced by the dynamic creator for MTGP. I have been considering the Park-Miller PRNG, as the memory requirements would be much less, however the relatively short period compared to the Mersenne Twister does put me off (and I understand the base implementation of the MT to be faster).

I’d be interested in any discussion comparing these two algorithms or any others I’ve yet to consider.

My main requirements are a very high number of independent streams, repeatability (based on a single initial seed spawning all other streams), low memory usage (ie one integer per stream at most). Finally a high period would of course be desirable. Any suggestions?

DrAnderson42 · January 19, 2011, 1:16pm

In HOOMD, we use Saru for this (developed by SPWorley on these forums). It stores 2 integers per thread and has a period of 2^(63.77), and comes with very robust hashes for seeding all parallel streams from a single global seed. We’ve got a paper submitted on how we effectively use the hashes, in particular, for a number of MD simulation techniques. If you are interested, I could send you a preprint.

seibert · January 19, 2011, 1:20pm

You should take a look at the CURAND library that is now included with the CUDA 3.2 toolkit:

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CURAND_Library.pdf

They use the XORWOW algorithm, which has a period greater than 2**190. That’s not quite as big as the Mersenne Twister, but XORWOW does pass many tests of RNG quality and has a small enough state to make per-thread generators practical.

The nice thing about the CURAND API is that it is well-designed for the parallel generation use case. You call the curand_init() function with a seed, a sequence number, and an offset. Every thread in a kernel should get the same seed, but then a unique thread ID number (like blockIdx.x * blockDim.x + threadIdx.x) should be passed as the sequence number. Then you are guaranteed to have statistically independent sequences for every thread (at least for 2**67 calls, see page 14 of the manual).

danuk · January 19, 2011, 3:46pm

Thanks guys. It seems the CURAND library should do what I’m after suitably. I would like to be able to use a CPU implementation on the host before moving this over to the GPU without introducing a dependency to the cuda runtime, but it seems this isn’t possible.

Saru sounds interesting too, let me have a go with the CURAND library first and see how that pans out.

SPWorley · January 19, 2011, 6:48pm

Yes, efficient GPU PRNG generation is research I’ve been working on in the background for a few years now, and in fact I’m in the middle of developing a new generation of PRNGs specifically for CUDA.

It’s a balance of seeding quality, strength, period, advancement efficiency, state size, portability, and of course, fundamentally, random quality.
For now, I’d recommend the CURAND library since it’s polished and works for most applications very well. But I’ll be publishing my own work (and releasing new code) hopefully next month. (I’ve had that “next month” goal for 4 months now though…)

seibert · January 19, 2011, 8:06pm

I’m not aware of a CPU implementation all ready to go, but the CURAND manual does reference the paper that the algorithm was taken from. In addition, the device-side implementation source code can be found in the curand_kernel.h header in the toolkit. It’s a little hard to decipher the code since CURAND has some algorithm-agnostic pieces, but between the header and the paper, you could write a CPU implementation.