curand_init sequence number problem

Hi!

I’ve got a problem when trying to initialize a number of curandStates in a kernel. I have the following code:

#include "cuda_runtime.h"
#include "curand_kernel.h"
#include <time.h>

//Defined elsewhere, but put here for readability
#define DISPLAY_WIDTH 1920
#define DISPLAY_HEIGHT 1080

__global__ void CurandSetup(curandState* states, const unsigned long seed, const int width, const int height)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;
	if (x >= width || y >= height) { return; }
	int idx = y * width + x; //Range(0,2073599)
	
	curand_init(seed, idx, 0, &states[idx]); //Problem seems to be here
        if (x == 1919 && y == 1079)
        {
            printf("Yey, it works\n"); //If this prints, I think it works
        }
}

curandState* CurandInit()
{
	curandState* states;
	cudaMallocManaged(&states, DISPLAY_WIDTH * DISPLAY_HEIGHT * sizeof(curandState));
	
	const int block_x = 32;
	const int block_y = 32;
	const dim3 block(block_x, block_y);
	const int grid_x = (DISPLAY_WIDTH + block_x - 1) / block_x; //=60
	const int grid_y = (DISPLAY_HEIGHT + block_y - 1) / block_y; //=34
	const dim3 grid(grid_x, grid_y);
	cudaError_t err = cudaGetLastError();
	CurandSetup <<<grid, block>>> (states, time(NULL), DISPLAY_WIDTH, DISPLAY_HEIGHT);
	cudaDeviceSynchronize();
	err = cudaGetLastError(); //Success when run in "Release" in Visual Studio

	return states;
}

I have a OpenGL window, and I want to have a dedicated curandState set up for each pixel, as I spawn a thread per pixel in my main kernel. After reading the docs and viewing multiple samples, the above is what I came up with. However, for some reason it fails (nothing is printed after the initialization for x=1919 and y=1079) and the memory is labeled as “Unable to read memory” in Visual Studio’s debugger. I’ve spent hours trying to find the issue, and it seems like it is the idx that is passed to curand_init that causes the problem. If I pass 0 it works fine, but if I pass the highest possible integer that could occur in my setting (2073599=1920*1080-1), it fails. I can’t seem to find any reasoning as to why this happens…

All insight is greatly appreciated, thanks :)

When I run your code, it prints out “Yey, it works”

However your kernel takes 22 seconds to run on my GTX 960

If you haven’t disabled the windows WDDM TDR timeout, you will certainly be hitting that.

Indeed you are correct. Never thought this kernel would take that long; on my 850M it took 60 seconds. Thank you :)

By taking a look at the code for curand_init, it is clear that due to the way the sequence number is used in for-loops in the subcall to _skipahead_sequence_scratch, the larger the sequence number, the longer it is going to take.

I guess the next question would be if you know if this is the “best” way of doing this? The reason I want to have a curandState for each pixel is that if I don’t, there will be trouble when multiple curand_uniform calls are made on the same curandState simultaneously. There are of course ways to solve this, but that will likely have an impact on performance there and then which is undesirable for my application (path tracing). Any thoughts?

Correct, you wouldn’t want to do that.

Is there some reason you don’t just use a different seed for each state, instead?

When I do this:

curand_init(seed+idx, 0, 0, &states[idx]);

the kernel runs in 1.7ms.

Using large values of either subsequence or offset will result in increased computation times for curand_init. There might be reasons why you would want to pick a particular subsequence or offset, but I can’t discern them from what you’ve provided so far. Using a different seed for each thread should give a different sequence for each thread.

http://docs.nvidia.com/cuda/curand/device-api-overview.html#pseudorandom-sequences

“Different seeds are guaranteed to produce different starting states and different sequences.”

Also, I think you know this already, but CUDA code generally runs faster in “Release” projects vs. the corresponding “Debug” projects in VS. The debug projects add -G switch which results in generation of generally slower device code.

Different seeds imply different sequences but not uncorrelated streams of pseudo-random numbers. Where that is needed proper offsetting or leapfrogging needs to be applied.

The Philox generator typically offers a favorable balance between the cost of properly offsetting many independent uncorrelated PRNG streams and the cost of PRNG generation itself that makes it suitable for many applications.

I see; since most examples I saw used a thread’s global id for the sequence number, I just assumed that was the best way to go. For my use of generating new paths on the hemisphere of a point on a mesh and its normal, I think just using a different seed will work fine. Thank you both :)

On my 850M, doing the switch you mention got me down to 29ms, which is practically irrelevant as the kernel is only called once on start-up. And yes, I’ve been running in Release mode. When running in Debug, I get cudaErrorLaunchOutOfResources after the kernel call in the error checking as seen in the above code. From what I’ve read, this is likely due to the number of registers the kernel uses taking the number of spawned threads into account vs. the number of registers the available in Debug mode, correct?

The output from compiling with -Xptxas=“-v” :

Used 63 registers, 6704 bytes cumulative stack size, 340 bytes cmem[0]

Yes, I would normally think the out of resources is likely due to a registers-per-thread issue. However the 850M is a compute capability 5.0 device (I believe) and it should have 64K registers per SM, so 63 registers per thread should not be an issue. It may be some other issue (stack/memory, etc.) Unless that is the compiler output from release mode, instead of debug mode.

Based on the suggestion by njuffa, I took your original code and the only modification I made was to replace each instance of curandState with curandStatePhilox4_32_10_t, and the kernel execution time dropped from ~23s to ~4ms. It seems that should give you uncorrelated sequences with relatively short setup time.

I didn’t benchmark actual generation times. I can’t comment on other differences between Philox and XORWOW, but there is the documentation.

The trade-offs between XORWOW and Philox are as follows:

XORWOW provides fastest PRNG generation of any generator provided by CURAND. It also provides the lowest quality PRNGs (as measured by standard test suites of randomness) out of the generators offered, but still sufficient for most applications. It features expensive skip-ahead (offsetting) and leap-frogging.

Philox generates PRNGs more slowly than XORWOW, but is not the slowest generator offered by CURAND. The quality of the PRNGs is in the middle of the generators offered by CURAND and should be suitable for all applications except possibly the most demanding ones (for which Mersenne Twister is indicated). Offsetting and leap-frogging is very cheap, cheaper than with any other generator offered by CURAND.

In situations where the execution time cost is dominated by offsetting / leap-frogging of the generators (in order to provide uncorrelated PRNG streams) such as this one, Philox is the obvious generator to try. I don’t offhand recall the trade-offs in terms of state size between XORWOW and Philox but I believe the difference is minor.

Thanks for all the info, very much appreciated :)