XORWOW Generator Understanding in CURAND

I have implemented XORWOW generator given in the paper http://www.jstatsoft.org/v08/i14/paper in C and got a set of random numbers as output. Below is the snippet of the logic:

uint32_t xorwow() {
  static uint64_t  x = 123456789, y = 362436069, z = 521288629,
                   w = 88675123, v = 5783321, d = 6615241;
  uint64_t;
  t = (x ^ (x >> 2));
  x = y;
  y = z;
  z = w;
  w = v;
  v = (v ^ (v << 4)) ^ (t ^ (t << 1));
  return (d += 362437) + v;
}
//Marsaglia's paper uses unsigned long as the word type. This has been changed to uint32_t to ensure the word type is an unsigned 32-bit integer.

The output which I got was:

Output:
246875399  3690007200  1264581005  3906711041  1866187943  2481925219  2464530826  2677782455  3653403911  2504343560

Then I tried using the same XORWOW generator using curandCreateGenerator function with CURAND_RNG_PSEUDO_XORWOW as parameter and set the seed value to zero (ie. A value of 0 for seed sets the state to the values of the original published version of the xorwow algorithm). The code that is used is an example given in the cuRAND documentation http://docs.nvidia.com/cuda/curand/host-api-overview.html#host-api-example

The output which I got was:

output : 3179217846 3955638199 167591721 4161663997 3973448494   1917059131 2866113984 472148682 2019573489 2204150021

According to my understanding both the implementation should give the same set of random numbers because its the same logic. But I am getting different; Is my understanding wrong or Is there anything that I missing out ?

No, it doesn’t. You are mistaken. That is not published anywhere, and it is not correct, even according to your own test.

The original algorithm had at least 4 32-bit seeds. The CURAND XORWOW algorithm has a single 64-bit seed, and it is not published anywhere how that seed relates to the original algorithm seeds.

Hey txbob,

Thank you.

The line does appear in the documentation http://docs.nvidia.com/cuda/curand/group__DEVICE.html#group__DEVICE under the device ​ void curand_init which initializes the state of XORWOW.

They have also mentioned that the XORWOW generator uses XORWOW implementation of the Xorshift RNG and given this paper as reference http://www.jstatsoft.org/v08/i14/paper.

My doubt is since we use only one seed and it is given that

Here what does “values” mean? Does it mean the x, y, z, w, v, d that is used in the paper?

The CURAND documentation does indeed say that “when the seed is set to 0, state is initialized to the values of the original xorwow algorithm”, and it is a reasonable interpretation that this means that the internal state of CURAND’s XORWOW generator is initialized the way it is specified in Marsaglia’s original paper.

What is not clear is whether your code that sets up CURAND configures XORWOW in all aspects (not just the seed) such that one would expect the output to match the serial implementation. Parallelization of PRNGs involves splitting the original sequence into portions for individual threads, which could happen by interleaving or block-partitioning, so it is not immediately clear that any particular thread should produce the original sequence produced by the serial implementation.

I have not used CURAND so I cannot help assess the validity of your CURAND setup. I actually do not see buildable code in the original post above that would someone let knowledgeable asses it, either.

I guess the bigger question here is: Why would it be important whether CURAND’s XORWOW generator, for any particular thread, produces a sequence identical to the serial XORWOW generator? The significance of that escapes me.

Hi njuffa,

Thank you. But,

Below is an implementation that i got from:http://cs.umw.edu/~finlayson/class/fall14/cpsc425/notes/23-cuda-random.html

#include <unistd.h>
#include <stdio.h>

/* we need these includes for CUDA's random number stuff */
#include <curand.h>
#include <curand_kernel.h>


/* this GPU kernel function calculates a random number and stores it in the parameter */
__global__ void random(unsigned int* result) {
  /* CUDA's random number library uses curandState_t to keep track of the seed value
     we will store a random state for every thread  */
  curandStateXORWOW_t state;

  /* we have to initialize the state */
  curand_init(0, /* the seed controls the sequence of random values that are produced */
              0, /* the sequence number is only important with multiple cores */
              0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
              &state);

  /* curand works like rand - except that it takes a state as a parameter */
  *result = curand(&state);
}

int main( ) {
  /* allocate an int on the GPU */
  unsigned int* gpu_x;
  cudaMalloc((void**) &gpu_x, sizeof (unsigned int));

  /* invoke the GPU to initialize all of the random states */
  random<<<1, 1>>>(gpu_x);

  /* copy the random number back */
  unsigned int x;
  cudaMemcpy(&x, gpu_x, sizeof(unsigned int), cudaMemcpyDeviceToHost);

  printf("Random number = %u.\n", x);

  /* free the memory we allocated */
  cudaFree(gpu_x);

  return 0;
}

The above code is equal to a serial code in all aspects since the parameters are set to <<1,1>>, while calling the kernel function. Used nvcc to compile.

Below is the normal C implementation of the Marsaglia’s XORWOW random number generator. Used gcc to compile.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint32_t xorwow() {
  static uint32_t x = 123456789, y = 362436069, z = 521288629, w = 88675123,
                  v = 5783321, d = 6615241;
  uint32_t t = (x ^ (x >> 2));
  x = y;
  y = z;
  z = w;
  w = v;
  v = (v ^ (v << 4)) ^ (t ^ (t << 1));
  return (d += 362437) + v;
}

int main(void) {
  size_t i;
  printf("xorwow:");
  printf("  %" PRIu32, xorwow());
  printf("\n");

  return 0;
}

Both give different random numbers.

Since, it both implements Marsaglia’s XORWOW random number generator, should it not give the same output given the same seed?

As I said, I am not familiar with CURAND, so I cannot assess whether the initialization shown in the code above should result in the same sequence being returned as the serial code. If the initialization is appropriate, presumably the answer to your question is “yes”. Assuming that is the case, I see at least two possible scenarios (which doesn’t mean yet other scenarios couldn’t apply):

[1] The CURAND documentation could be in error, and passing a seed of zero is not in fact intended to be equivalent to the seeding specified for the published serial algorithm. Cut & paste errors have known to occur in CUDA documentation, and I see that identical language about a seed of zero is mentioned for at least one other generator.

[2] The CURAND documentation correctly states what is supposed to happen, but there is an error in the implementation of the internal seeding (this could be as simple as a typo in one of the published seed values). As far as I understand, given a fairly short sequence of output values, one can determine the internal state of simple generators like XORWOW, but I do not know off-hand how to do it, so I can’t determined which seed values are actually used.

Best I can tell, you have posted nearly identical questions about CURAND’s XORWOW generator to at least three different forums, so if you are worried about possibilities [1] and [2] above, I would suggest filing a bug report with NVIDIA. Personally, I don’t see how the exact seed values matter in practical terms.

In case anyone ever wonders, there is something on top on the initialisation of the state. This creates the curand numbers:

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint32_t xorwow_curand(unsigned long long seed) {

  // Initialisation
  static uint32_t x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
  unsigned int s0 = ((unsigned int) seed) ^ 0xaad26b49UL;
  unsigned int s1 = (unsigned int)(seed >> 32) ^ 0xf7dcefddUL;
  unsigned int t0 = 1099087573UL * s0;
  unsigned int t1 = 2591861531UL * s1;
  d = 6615241 + t1 + t0;
  x = x + t0;
  y = y ^ t0;
  z = z + t1;
  w = w ^ t1;
  v = v + t0;

  // Run
  uint32_t t = (x ^ (x >> 2));
  x = y;
  y = z;
  z = w;
  w = v;
  v = (v ^ (v << 4)) ^ (t ^ (t << 1));
  return (d += 362437) + v;
}

int main(void) {
  unsigned long long seed = 0;

  printf("xorwow:");
  printf("  %" PRIu32, xorwow());
  printf("\n");

  return 0;
}

Edit: Only the first number fits, as you would have to split the initilisation from the run part.

1 Like