Unspecified Launch Failure from "volatile" adding "volatile" causes random ULF

Hi,

I’m experiencing a ULF when executing the same kernel repeatedly on the same data, but this error seems to occur at different iterations each time I run the program. I’ve gone through the forums and read that a few people have experienced similar problems.

Have you guys got any suggestions on how to move forward? In particular, my kernel doesn’t exhibit the ULF at all if I remove the “volatile” keyword from shared memory - which is really strange.

I’ve kind of hit a brick wall on this :(

I can post my code if you think it will help, but it’s a bit long. The reason for using “volatile” is that I need threads to perform atomic operations on shared memory using the “write-combining” approach shown in the Histogram SDK example.

-Shihab

I have a lot of experience with random ULF’s. Here are all the causes I’ve found:

  1. Reading/writing past the end of allocated memory.

This is easy to do: common cases are where the nearest multiple of blocks puts some threads past the end of the array. You can check for these cases by compiling in emulation and debug mode and running the program through valgrind (linux only).

  1. Overheating GPU. After running for ~5 hours at 79C in a severely cramped and under-cooled box, I’ve seen random kernels ULFing or experience the 5s launch timeout.

  2. A hardware bug (my supposition: NVIDIA hasn’t confirmed this)

Some kernels with particular sequences of operations just seem to trigger random ULFs every ~40,000 calls. I have had two kernels that cause this. The first contained a particularly complicated set of warp divergences:

// rough code outline....

idx = threadIdx.x + blockIdx.x*blockDim.x;

for (i=0; i < NN[idx]; i++)

   {

   ... do stuff ...

   }

NN[idx] varies from 40-120 from thread to thread causing warp divergences at the tail of the loop. Changing the loop to that below fixes the random ULF (tested for 150 million+ calls)

// rough code outline....

idx = threadIdx.x + blockIdx.x*blockDim.x;

NN_temp = NN[idx];

for (i=0; i < maximum_NN; i++)

   {

   if (i < NN_temp)

      {

      ... do stuff ...

      }

   }

Here, the worse divergence that can happen is a two-way divergence on the if().

The other kernel that cause the problem was a massively complicated beast with hundreds of __syncthreads() and significant warp divergences. The problem went away when I rewrote the kernel in a much simpler way without using shared memory (and thus no __syncthreads() anymore), but the pattern of warp divergences is similar to before.

Thanks for your explanation MisterAnderson42.

I had been doing bounds checking for global memory and not shared memory. As soon as this was rectified the ULF’s disappeared.

The reason for the "random"ness was that the single-precision floats, converted to an int using “floorf”, could be negative if the float value was approximating zero but on the slightly negative side. Something like:

while (index_f > 2*pi)

  index_f -= 2*pi;

int index = floorf(index_f); // index could sometimes be < 0 on the gpu

I was using this value to index in to the shared memory array.

Actually upon more testing it turns out the ULF still does appear randomly.

I’ve checked the temp on my GPU and it doesn’t go past 70’C.

The number of iterations it dies on varied from 500 to 10000.

I’ve kept cutting my code till I cold get a consistent ULF everytime (and tried to mimic as much of the Histogram-256 SDK code as possible).

Here’s the minimal code:

__global__ void buildDescriptorsMultiThreaded(int nPoints, float* d_MinMax_X, float* d_MinMax_Y, float* d_MinMax_S, float* d_MinMax_O, float* d_Descriptors, int octave, int w, int h)

{

  volatile __shared__ unsigned int hist[128];

  const int n = blockIdx.x;

 const unsigned int threadTag = threadIdx.x << (32 - 5);

 // force a collision on index = 1 in shared memory

  addData256(hist, 1, threadTag);

 // need to output something

  d_Descriptors[128*n + tx] = __int_as_float(hist[tx]);

}

__device__ void addData256(volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag){

  unsigned int count;

  do{

    count = s_WarpHist[data] & 0x07FFFFFFU;

    count = threadTag | (count + 1);

    s_WarpHist[data] = count;

  }while(s_WarpHist[data] != count);

}

This code crashes immediately with a ULF when I run it (run with 128 threads/block).

Removing the loop to do the atomic shared memory update gets rid of the ULF. Even just removing the “volatile” keyword gets rid of the ULF.

Is there anything obvious that looks wrong with the code?

well, I have never heard of an atomic shared memory update. There is no such thing I am afraid.
I am not really good at C, so I don’t see what you want to do in that function, all I see is that there will be 4 warps accessing the same shared memory location, and a while that depends on that memory location. So for all I know this while might be running for >5 sec, giving you the error (all 128 threads need to have the while terminating for them)

Maybe shihab refer atomic shared mem update to put updates for each thread in different if/else branch (a trick to do atomic shared mem update). Well, I experience ULF when I have some shared memory write conflict months ago. But since the result is unpredictable and is actually a bug in my code, I get rid of it after fixing that bug. BTW, conflict writes to global memory wont cause ULF in my case.

Yeah the atomic shared memory update is a software “trick”.

From my understanding, the loop should only execute 16 times at most (half-warp size). So I’m not sure why it’d be looping forever or for 128 threads. Although I’m not getting random ULFs anymore, it consistently occurs - on the first iteration.

This “trick” is explained in:

http://developer.download.nvidia.com/compu…c/histogram.pdf

Basically, each thread in the warp tags the shared memory with it’s thread number (relative to the warp). If the value of the shared memory is equal to value the thread set it to, then it exits the warp, otherwise it retries until success. After each iteration of the while loop, atl least one thread must have successfully written to the shared memory, leading to a maximum of 16 iterations.

Is it possible for someone to try the code I’ve posted to verify there is a problem and it’s not just my machine? (I only have one cuda machine) Or is there anything obviously wrong with the code?

I think conflicting writes to shared memory get serialised (bank conflict) and the last write succeeds. Has anyone else experienced similar issues with bank conflicts or using “volatile” to ensure threads do not cache shared memory?

Further, the error seems to go away if the number of blocks in the grid is reduced (!?).

So for example, if I launch with 300 blocks I get a ULF straight away. If I launch with 100 blocks, I get a ULF randomly after a few (3-4) iterations. If I launch with only 10 blocks, I can run for 10,000 iterations without any problem.

Any ideas?

I have exactly the same problem.

I do histogram analysis as described in the previously mentioned example on RGB. When I run the program I get randomly ULF’s (within 5000 runs). I tested it on both a Geforce 8800 GTX (1.0) and a Geforce 8800 GT (1.1).

Note: I’m using the stable 1.1 and NOT 2.0 (beta).

edit: I tested running with WARP_LOG_SIZE set to 0 and then I get no ULF’s, also the ULF does not occur when I set the WarpCount to 1.