Inconsistent offsets in CURAND Host API
Hi,

I've experimented with CURAND's host API using multiple GPUs, and also multi-threaded CPUs as a fallback solution. To make the sequences generated consistent with a single-GPU or single CPU sequence, I need to use the curandSetGeneratorOffset function. The problem is, that this function has inconsistent behaviour for different generators, and for the MRG32K3A, I can't achieve what I want at all. Here are my findings:

Generator XORWOW:
[list]
[*] offsets work correctly for curandGenerate, i.e., offsetting by 1 result in the second number in original sequence
[*] same for curandGenerateUniform (float version) - offset of 1 -> second number in original sequence
[*] for curandGenerateUniformDouble, each number seems to require 2 items, so offset of 2 is needed
[/list]

Generator MRG32K3A:
[list]
[*] offset of 1 results in offset of 4096 in original sequence
[*] float and double version both only require a single number from the generator (factor 2 not needed)
[*] if I want an offset of 1 (and not 4096), I cannot achieve this. I probably need to offset by 2^76, which is impossible with the unsigned long long parameter type (unless calling the offset func 2^12 times)
[/list]

Questions:
[list]
[*] Did anyone come across similar problems?
[*] Is there any chance to get the offsetting function to work in the same way for all generators?
[*] How can I achieve an offset of 1 (2, 3, ...) with MRG32K3a?
[*] Any workarounds (both CPU and GPU)?
[/list]
Hi,



I've experimented with CURAND's host API using multiple GPUs, and also multi-threaded CPUs as a fallback solution. To make the sequences generated consistent with a single-GPU or single CPU sequence, I need to use the curandSetGeneratorOffset function. The problem is, that this function has inconsistent behaviour for different generators, and for the MRG32K3A, I can't achieve what I want at all. Here are my findings:



Generator XORWOW:


  • offsets work correctly for curandGenerate, i.e., offsetting by 1 result in the second number in original sequence
  • same for curandGenerateUniform (float version) - offset of 1 -> second number in original sequence
  • for curandGenerateUniformDouble, each number seems to require 2 items, so offset of 2 is needed




Generator MRG32K3A:


  • offset of 1 results in offset of 4096 in original sequence
  • float and double version both only require a single number from the generator (factor 2 not needed)
  • if I want an offset of 1 (and not 4096), I cannot achieve this. I probably need to offset by 2^76, which is impossible with the unsigned long long parameter type (unless calling the offset func 2^12 times)




Questions:


  • Did anyone come across similar problems?
  • Is there any chance to get the offsetting function to work in the same way for all generators?
  • How can I achieve an offset of 1 (2, 3, ...) with MRG32K3a?
  • Any workarounds (both CPU and GPU)?

#1
Posted 04/24/2012 12:04 PM   
Hi,
Thanks for the post, you have indeed found some discrepancies in CURAND's offset functionality. A bug has been filed on our end to get the offset function to work the same for MRG32K3A as for the others (and look into other related issues). Currently, as you have observed, each unit of offset advances MRG32K3A by 4K positions in the sequence.

As a wokaround, I've been able to get the following to work for me for MRG32K3A:

curandSetGeneratorOffset( generator, offset>>12);
if ((offset % 4096) != 0){
curandGenerate( generator, deviceIntPtr, (offset % 4096) );
}

... where deviceIntPtr points to a scratch area on the device large enough to hold 4K ints.... The above code puts the generator in the same state as when I call curandGenerate(...) iteratively to generate offset ints.

Let me know if this isn't working for your case.

regards
Paul S.
Hi,

Thanks for the post, you have indeed found some discrepancies in CURAND's offset functionality. A bug has been filed on our end to get the offset function to work the same for MRG32K3A as for the others (and look into other related issues). Currently, as you have observed, each unit of offset advances MRG32K3A by 4K positions in the sequence.



As a wokaround, I've been able to get the following to work for me for MRG32K3A:



curandSetGeneratorOffset( generator, offset>>12);

if ((offset % 4096) != 0){

curandGenerate( generator, deviceIntPtr, (offset % 4096) );

}



... where deviceIntPtr points to a scratch area on the device large enough to hold 4K ints.... The above code puts the generator in the same state as when I call curandGenerate(...) iteratively to generate offset ints.



Let me know if this isn't working for your case.



regards

Paul S.

#2
Posted 05/02/2012 09:04 PM   
Thanks Paul for the workaround. It worked in the most basic configuration. But I am running into a whole load of other problems with the offsets - far too many different situations to describe them all here (I will give one example later).

What I am trying to achieve is a multi-GPU or multi-threaded CPU implementation of a random number generator, wrapping cuRAND. It should always produce the same sequence, no matter which GPUs or CPUs are involved - so clearly I need the offset functions to work well. Also it should be able to generate float and double, for uniform, normal, and lognormal distributions. Here are some of the issues I run into:

[b]Offsets behave differently for different distributions, data types, and host/device (already applying the MRG32K3A workaround from previous post):[/b]
[list]
[*] CPU, XORWOW, uniform/normal/lognormal, float: offset n
[*] CPU, XORWOW, uniform, double: offset 2*n
[*] CPU, XORWOW, normal/lognormal, double: offset 4*n
[*] GPU, XORWOW, uniform, float: offset n
[*] GPU, XORWOW, uniform, double: offset 2*n
[*] GPU, XORWOW, normal/lognormal, float: offset 2*n
[*] GPU, XORWOW, normal/lognormal, double: offset 4*n
[*] CPU, MRG32K3A, uniform/normal/lognormal, float/double: offset n
[*] CPU, MRG32K3A, normal/lognormal, double: offset 2*n
[*] GPU, MRG32K3A, uniform, float/double: offset n
[*] GPU, MRG32K3A, normal/lognormal, float/double: offset 2*n
[/list]

[b]Offsets and above adjustments behave differently if n % 4096 != 0[/b]
All of the above works fine as long as I generate the numbers with a split of multiples of 4096 in each generator. If I use different splits (always taking care that normal/lognormal has a multiple of two), things start to go really wrong. For example, using XORWOW, normal distribution, double on the GPU, and applying an offset of 8194 (with above corrections, so 4*8194), the numbers are all correct compared to a continues host sequence up to index 16383, and from then on they are all wrong. There are many more examples like this, where offset is correct in the first place, and starts going wrong after a few numbers.

Further, the above adjustments seem to change if the number is not a multiple of 4096. The MRG32K3A suddenly needs an offset of 1 (not 2*n) for double/normal. Etc. ...

Basically, I can't find a set of offset adjustments and workarounds like the one you posted that makes the offsets work in a consistent manner and let me generate the same sequence from the same seed, no matter where I execute things.

Do you have any suggestions what I can do? Or why the above is happening?
Thanks Paul for the workaround. It worked in the most basic configuration. But I am running into a whole load of other problems with the offsets - far too many different situations to describe them all here (I will give one example later).



What I am trying to achieve is a multi-GPU or multi-threaded CPU implementation of a random number generator, wrapping cuRAND. It should always produce the same sequence, no matter which GPUs or CPUs are involved - so clearly I need the offset functions to work well. Also it should be able to generate float and double, for uniform, normal, and lognormal distributions. Here are some of the issues I run into:



Offsets behave differently for different distributions, data types, and host/device (already applying the MRG32K3A workaround from previous post):


  • CPU, XORWOW, uniform/normal/lognormal, float: offset n
  • CPU, XORWOW, uniform, double: offset 2*n
  • CPU, XORWOW, normal/lognormal, double: offset 4*n
  • GPU, XORWOW, uniform, float: offset n
  • GPU, XORWOW, uniform, double: offset 2*n
  • GPU, XORWOW, normal/lognormal, float: offset 2*n
  • GPU, XORWOW, normal/lognormal, double: offset 4*n
  • CPU, MRG32K3A, uniform/normal/lognormal, float/double: offset n
  • CPU, MRG32K3A, normal/lognormal, double: offset 2*n
  • GPU, MRG32K3A, uniform, float/double: offset n
  • GPU, MRG32K3A, normal/lognormal, float/double: offset 2*n




Offsets and above adjustments behave differently if n % 4096 != 0

All of the above works fine as long as I generate the numbers with a split of multiples of 4096 in each generator. If I use different splits (always taking care that normal/lognormal has a multiple of two), things start to go really wrong. For example, using XORWOW, normal distribution, double on the GPU, and applying an offset of 8194 (with above corrections, so 4*8194), the numbers are all correct compared to a continues host sequence up to index 16383, and from then on they are all wrong. There are many more examples like this, where offset is correct in the first place, and starts going wrong after a few numbers.



Further, the above adjustments seem to change if the number is not a multiple of 4096. The MRG32K3A suddenly needs an offset of 1 (not 2*n) for double/normal. Etc. ...



Basically, I can't find a set of offset adjustments and workarounds like the one you posted that makes the offsets work in a consistent manner and let me generate the same sequence from the same seed, no matter where I execute things.



Do you have any suggestions what I can do? Or why the above is happening?

#3
Posted 05/16/2012 04:50 PM   
Hi,
This is indeed a bit confusing, and at the very least warrants better documentation. It’s also made more complicated by a bug in the way offsets are handled in the normal and log normal distributions. To understand why you can’t get back in sync unless you draw multiples of 4096, here’s an example of what happens when you draw uniform doubles on an XORWOW generator.

The curand host API invokes the kernel generators with a block count of 64 and a thread count of 64. Each of the 4096 threads is operating on a state in a different part of the main sequence, spaced apart by 2^67. The basic issue is that when you generate integers or uniform floats, each thread advances its state by 1 for each output sample. When you generate uniform doubles with XORWOW, each thread advances its state by two, so unless each thread produces the same number of output samples, some threads get out of sync (i.e. they are too far ahead of the other threads)

Suppose all threads start at position zero. If you draw 6144 integers, all 4096 states get advanced by one, and states 0..2047 are advanced a second time to state 2. If you draw 4096 more samples, curand starts with state 2048 and advances states 2048..4095 to state 2, and wraps to advance states 0..2047 to state 3.

When you call curandGenerateUniformDouble with XORWOW, each thread draws two integers, which it combines to make a 53 bit mantissa. Suppose again you start with all threads at position zero. If you draw 2048 samples, 2048 of the threads advance their state to position 2, and the remaining threads (2048..4095) are still at position zero. Now if you draw 4096 integers, states 2048..4095 get advanced to state 1, and states 0..2047 get advanced to state 3. There’s no way (short of a lot of complicated book-keeping) to get back to the “usual” situation where some of the states are only one ahead of the others. This is why, as you have observed, you can only get offsets to work if you draw multiples of 4096 when calling curandGenerateUniformDouble with an XORWOW generator.

When you generate normals or log-normals with XORWOW or MRG32K3A, curand uses the Box-Muller transform to get from the uniform distribution to the normal/log-normal distribution. In this case, each thread draws two uniforms, and returns two results. Since each thread returns two results, you have to draw 8192 samples to cycle through and advance all 4096 states.

As I mentioned this is all made more complicated by a bug I found while looking at the normal/log-normal code, to see what’s happening. The workaround for that bug, to get offsets to work, is to generate something other than normal or log-normal results the first time you invoke curand after creating the generator or setting the offset.

To summarize then,

For XORWOW uniform double you need to draw multiples of 4096

MRG32K3A draws a single result to generate a uniform double, so it does not have the above restriction.

For XORWOW and MRG32K3A, normal and log-normal, single and double, you need to draw multiples of 8192.

For XORWOW all double precision results draw two samples for each result.

For MRG23K3A all results draw one input sample for each result.

That’s the situation as of CUDA 4.2. I’ve fixed the offset bug in normal and log-normal, and expect that fix to get into CUDA 5.0. That would eliminate the need to generate ints or uniforms before generating any normals or log-normals

For the future, I could envision adding an ordering type to curandSetGeneratorOrdering, that would cause curand to use only one sample per result, and to use the icdf method for generating normal/log-normal.

This would effectively eliminate all of the above restrictions. Generating n samples of any distribution would result in the same state as if you created a new generator and set the offset to n. Would this resolve your issues, or is there something more we should be providing? We’re always interested in feedback on our library functions.
Hi,

This is indeed a bit confusing, and at the very least warrants better documentation. It’s also made more complicated by a bug in the way offsets are handled in the normal and log normal distributions. To understand why you can’t get back in sync unless you draw multiples of 4096, here’s an example of what happens when you draw uniform doubles on an XORWOW generator.



The curand host API invokes the kernel generators with a block count of 64 and a thread count of 64. Each of the 4096 threads is operating on a state in a different part of the main sequence, spaced apart by 2^67. The basic issue is that when you generate integers or uniform floats, each thread advances its state by 1 for each output sample. When you generate uniform doubles with XORWOW, each thread advances its state by two, so unless each thread produces the same number of output samples, some threads get out of sync (i.e. they are too far ahead of the other threads)



Suppose all threads start at position zero. If you draw 6144 integers, all 4096 states get advanced by one, and states 0..2047 are advanced a second time to state 2. If you draw 4096 more samples, curand starts with state 2048 and advances states 2048..4095 to state 2, and wraps to advance states 0..2047 to state 3.



When you call curandGenerateUniformDouble with XORWOW, each thread draws two integers, which it combines to make a 53 bit mantissa. Suppose again you start with all threads at position zero. If you draw 2048 samples, 2048 of the threads advance their state to position 2, and the remaining threads (2048..4095) are still at position zero. Now if you draw 4096 integers, states 2048..4095 get advanced to state 1, and states 0..2047 get advanced to state 3. There’s no way (short of a lot of complicated book-keeping) to get back to the “usual” situation where some of the states are only one ahead of the others. This is why, as you have observed, you can only get offsets to work if you draw multiples of 4096 when calling curandGenerateUniformDouble with an XORWOW generator.



When you generate normals or log-normals with XORWOW or MRG32K3A, curand uses the Box-Muller transform to get from the uniform distribution to the normal/log-normal distribution. In this case, each thread draws two uniforms, and returns two results. Since each thread returns two results, you have to draw 8192 samples to cycle through and advance all 4096 states.



As I mentioned this is all made more complicated by a bug I found while looking at the normal/log-normal code, to see what’s happening. The workaround for that bug, to get offsets to work, is to generate something other than normal or log-normal results the first time you invoke curand after creating the generator or setting the offset.



To summarize then,



For XORWOW uniform double you need to draw multiples of 4096



MRG32K3A draws a single result to generate a uniform double, so it does not have the above restriction.



For XORWOW and MRG32K3A, normal and log-normal, single and double, you need to draw multiples of 8192.



For XORWOW all double precision results draw two samples for each result.



For MRG23K3A all results draw one input sample for each result.



That’s the situation as of CUDA 4.2. I’ve fixed the offset bug in normal and log-normal, and expect that fix to get into CUDA 5.0. That would eliminate the need to generate ints or uniforms before generating any normals or log-normals



For the future, I could envision adding an ordering type to curandSetGeneratorOrdering, that would cause curand to use only one sample per result, and to use the icdf method for generating normal/log-normal.



This would effectively eliminate all of the above restrictions. Generating n samples of any distribution would result in the same state as if you created a new generator and set the offset to n. Would this resolve your issues, or is there something more we should be providing? We’re always interested in feedback on our library functions.

#4
Posted 05/21/2012 11:49 PM   
Thanks for the clarifications Paul. I think I understood the issues and how it works, and can probably find a workaround to make it all work for my code.

Remaining questions:
[list]
[*] Why does the XORWOW use 2 samples for a double and the MRG only 1 sample? I understand that a sample is a 32bit integer, so generating the 53bit mantissa for a double would always need two samples for a full range of values, right? Or is there another reason for this?
[*] How is the different behaviour between Host and Device versions explained? As seen in the post above, the results/offsets are not consistent between CPU and GPU.
[/list]

Looking a the API, I think there is another inconsistency that I saw (but I guess that's hard to change now). The uniform distribution also has two parameters - a lower and upper bound. So I think the three functions for generating values according to a distribution should have been:
[code]
curandGenerateNormal (gen, size, mean, stddev)
curandGenerateLogNormal(gen, size, mu, sigma) // mu/sigma parameters of lognormal are not the mean/stddev of the generated samples
curandGenerateUniform (gen, size, lower, upper)
[/code]
Of course it is no problem to transform [0,1] results, but for consistency I expected the above functions in the API.
Thanks for the clarifications Paul. I think I understood the issues and how it works, and can probably find a workaround to make it all work for my code.



Remaining questions:


  • Why does the XORWOW use 2 samples for a double and the MRG only 1 sample? I understand that a sample is a 32bit integer, so generating the 53bit mantissa for a double would always need two samples for a full range of values, right? Or is there another reason for this?
  • How is the different behaviour between Host and Device versions explained? As seen in the post above, the results/offsets are not consistent between CPU and GPU.




Looking a the API, I think there is another inconsistency that I saw (but I guess that's hard to change now). The uniform distribution also has two parameters - a lower and upper bound. So I think the three functions for generating values according to a distribution should have been:



curandGenerateNormal (gen, size, mean, stddev)

curandGenerateLogNormal(gen, size, mu, sigma) // mu/sigma parameters of lognormal are not the mean/stddev of the generated samples

curandGenerateUniform (gen, size, lower, upper)


Of course it is no problem to transform [0,1] results, but for consistency I expected the above functions in the API.

#5
Posted 05/22/2012 11:17 AM   
Managed to get it all working now, by using only curandGenerate (with the MRG32K3A workaround for the offsets you proposed earlier) and doing the normal/lognormal and float/double transformations manually in custom kernels (or on CPU).

That's probably an idea to fix all the above issues - if the transformations to float/double and the distributions are done in a separate kernel call, after the generate, the problems with the Box-Mueller transform bringing things out of sync don't exist anymore. And always using two unsigned int samples for a double, and one for a single, makes everything consistent. The only problem for the library is then that the device API generates a different sequence if the curand_normal etc. are used there...
Managed to get it all working now, by using only curandGenerate (with the MRG32K3A workaround for the offsets you proposed earlier) and doing the normal/lognormal and float/double transformations manually in custom kernels (or on CPU).



That's probably an idea to fix all the above issues - if the transformations to float/double and the distributions are done in a separate kernel call, after the generate, the problems with the Box-Mueller transform bringing things out of sync don't exist anymore. And always using two unsigned int samples for a double, and one for a single, makes everything consistent. The only problem for the library is then that the device API generates a different sequence if the curand_normal etc. are used there...

#6
Posted 05/23/2012 04:53 PM   
If you are doing your own Box-Muller transforms, I strongly recommend the use of sincos(). If you are already using the CUDA 5.0 preview, you would want to use sincospi() instead for an additional performance boost. For single-precision, try __sincosf() for best performance.
If you are doing your own Box-Muller transforms, I strongly recommend the use of sincos(). If you are already using the CUDA 5.0 preview, you would want to use sincospi() instead for an additional performance boost. For single-precision, try __sincosf() for best performance.

#7
Posted 05/23/2012 05:54 PM   
Thanks, I'll try that.
Thanks, I'll try that.

#8
Posted 05/23/2012 06:32 PM   
Scroll To Top