atomicAdd crash

Here’s a small test program:

__device__ __managed__ unsigned int count ;

// ----------------------------------------------------------------------------
__global__ void	Test( void ) 
{
	atomicAdd( &count, 1 ) ;
}

// ----------------------------------------------------------------------------
void	Cuda_Compute( void )
{
	dim3	numBlocks( 16, 16 ) ;

	for ( int i=0; i < 5; i++ ) {
	    numBlocks.x <<= 1 ;
	    numBlocks.y <<= 1 ;

	    Test <<< numBlocks, 64 >>> () ;

	    cudaError_t status = cudaDeviceSynchronize() ;
	    if ( status != cudaSuccess )
		assert( 0 ) ;

	    printf( "(%dx%dx%d) count=%d\n",  numBlocks.x,  numBlocks.y, 64, count ) ;
	    count = 0 ;
	}
}

On my GTX1080, it will work for the steps up to 256x256x64 (i=0 to 3) but crash on 512x512x64 (i=4).

Now, before you tell me, I know that atomicAdd serialize the instructions and slow down everybody. I use it in my undos. I compare 2 volumes and get a list of changes. Usually, these are small and I keep my real time interaction. But some operations affect a big chunk of the volume. These come from a menu selection in my application, and I don’t mind waiting a bit for the undo processing, but I do mind a crash!

I guess there is a limit associated with atomicAdd? I didn’t see any mentions of that in the doc…

Yves

What happens if you add proper status checks to all CUDA API functions and all kernel launches? Alternatively (or better: additionally), what happens if you run your code under the control of cuda-memcheck, with all relevant checking tools enabled?

Bonjour njuffa and thank you for your suggestions.

But, there’s no memory involved (just one unsigned int, the variable “count” and it crash before it overflow (512x512x64 = 0x0100 0000).

and as far as “proper status check”, I only have 1 launch, and I do test it. The only API call would be the atomicAdd, and I don’t think there’s a “proper status check” on that one?

The code I presented is not an excerpt, that’s the entire code! It’s just to demonstrate the problem, I run it all by itself and… crash!

Yves

There aren’t any limits on atomicAdd along the lines of what you are suggesting.

I ran your code on CUDA 8RC on a Tesla K20X on RHEL 7 and it completed successfully for me with the following test case:

$ cat t1246.cu
#include <stdio.h>
#include <assert.h>
__device__ __managed__ unsigned int count ;

// ----------------------------------------------------------------------------
__global__ void Test( void )
{
        atomicAdd( &count, 1 ) ;
}

// ----------------------------------------------------------------------------
void    Cuda_Compute( void )
{
        dim3    numBlocks( 16, 16 ) ;

        for ( int i=0; i < 5; i++ ) {
            numBlocks.x <<= 1 ;
            numBlocks.y <<= 1 ;

            Test <<< numBlocks, 64 >>> () ;

            cudaError_t status = cudaDeviceSynchronize() ;
            if ( status != cudaSuccess )
                assert( 0 ) ;

            printf( "(%dx%dx%d) count=%d\n",  numBlocks.x,  numBlocks.y, 64, count ) ;
            count = 0 ;
        }
}

int main(){

  Cuda_Compute();
}
$ nvcc -arch=sm_35 -o t1246 t1246.cu
$ ./t1246
(32x32x64) count=65536
(64x64x64) count=262144
(128x128x64) count=1048576
(256x256x64) count=4194304
(512x512x64) count=16777216
$

What OS are you running this on? Is the GTX1080 servicing a display?

Hum…

Indeed it does not seem to be limit on atomicAdd,

I changed my test a little bit, now I run multiple times with the same parameters (it seem to have problems when the numBlocks reach 232) and it crash (to a black screen) after a number of iterations, not always the same number (5 to 10)

...
	dim3	numBlocks( 232, 232 ) ;

	for ( int i=0; i < 32; i++ ) {

	    Test <<< numBlocks, 64 >>> () ;
	...

If I increase from 232 to 234, it crash around the second iteration. At 512, it always crash at the first iteration.

I’m running on Window 7 pro, service pack-1.

And yes the 1080 drive my 3 screens (res=7680x144)

Now I’m completely lost…

txbob,

Would you mind emailing me the exec of your test (t1246.exe) so that I can run it here?

I’m using the version 7.5 of the compiler. I don’t think using 8.0 would make a difference, but I’d like to try it.

Yves

martel(AT)acm.org

The crash might be due to the WDDM timeout on windows. Have you made any changes to it?

Yes! That’s it!

Thank you txbob.

I changed my timeout from the default 2 to 20 sec and now I can get to 512x512x64.

But I guess I should come up with a better algorithm for my undos in this situation, I don’t mind a couple of second wait, but if it take 20 sec on a GTX1080…

Anybody can suggest something? I have 2 volumes of the same dimensions, and I need a list of the voxels that are different…

Yves

A simple measure would be to aggregate the number of changed voxels to the block level, then do a single atomicAdd() per block, and distribute the index back to all threads in the block.