atomicAdd crash

Y_Martel · August 24, 2016, 4:50pm

Here’s a small test program:

__device__ __managed__ unsigned int count ;

// ----------------------------------------------------------------------------
__global__ void	Test( void ) 
{
	atomicAdd( &count, 1 ) ;
}

// ----------------------------------------------------------------------------
void	Cuda_Compute( void )
{
	dim3	numBlocks( 16, 16 ) ;

	for ( int i=0; i < 5; i++ ) {
	    numBlocks.x <<= 1 ;
	    numBlocks.y <<= 1 ;

	    Test <<< numBlocks, 64 >>> () ;

	    cudaError_t status = cudaDeviceSynchronize() ;
	    if ( status != cudaSuccess )
		assert( 0 ) ;

	    printf( "(%dx%dx%d) count=%d\n",  numBlocks.x,  numBlocks.y, 64, count ) ;
	    count = 0 ;
	}
}

On my GTX1080, it will work for the steps up to 256x256x64 (i=0 to 3) but crash on 512x512x64 (i=4).

Now, before you tell me, I know that atomicAdd serialize the instructions and slow down everybody. I use it in my undos. I compare 2 volumes and get a list of changes. Usually, these are small and I keep my real time interaction. But some operations affect a big chunk of the volume. These come from a menu selection in my application, and I don’t mind waiting a bit for the undo processing, but I do mind a crash!

I guess there is a limit associated with atomicAdd? I didn’t see any mentions of that in the doc…

Yves

njuffa · August 24, 2016, 5:50pm

What happens if you add proper status checks to all CUDA API functions and all kernel launches? Alternatively (or better: additionally), what happens if you run your code under the control of cuda-memcheck, with all relevant checking tools enabled?

Y_Martel · August 24, 2016, 5:59pm

Bonjour njuffa and thank you for your suggestions.

But, there’s no memory involved (just one unsigned int, the variable “count” and it crash before it overflow (512x512x64 = 0x0100 0000).

and as far as “proper status check”, I only have 1 launch, and I do test it. The only API call would be the atomicAdd, and I don’t think there’s a “proper status check” on that one?

The code I presented is not an excerpt, that’s the entire code! It’s just to demonstrate the problem, I run it all by itself and… crash!

Yves

Robert_Crovella · August 24, 2016, 6:23pm

There aren’t any limits on atomicAdd along the lines of what you are suggesting.

I ran your code on CUDA 8RC on a Tesla K20X on RHEL 7 and it completed successfully for me with the following test case:

$ cat t1246.cu
#include <stdio.h>
#include <assert.h>
__device__ __managed__ unsigned int count ;

// ----------------------------------------------------------------------------
__global__ void Test( void )
{
        atomicAdd( &count, 1 ) ;
}

// ----------------------------------------------------------------------------
void    Cuda_Compute( void )
{
        dim3    numBlocks( 16, 16 ) ;

        for ( int i=0; i < 5; i++ ) {
            numBlocks.x <<= 1 ;
            numBlocks.y <<= 1 ;

            Test <<< numBlocks, 64 >>> () ;

            cudaError_t status = cudaDeviceSynchronize() ;
            if ( status != cudaSuccess )
                assert( 0 ) ;

            printf( "(%dx%dx%d) count=%d\n",  numBlocks.x,  numBlocks.y, 64, count ) ;
            count = 0 ;
        }
}

int main(){

  Cuda_Compute();
}
$ nvcc -arch=sm_35 -o t1246 t1246.cu
$ ./t1246
(32x32x64) count=65536
(64x64x64) count=262144
(128x128x64) count=1048576
(256x256x64) count=4194304
(512x512x64) count=16777216
$

What OS are you running this on? Is the GTX1080 servicing a display?

Y_Martel · August 24, 2016, 6:49pm

Hum…

Indeed it does not seem to be limit on atomicAdd,

I changed my test a little bit, now I run multiple times with the same parameters (it seem to have problems when the numBlocks reach 232) and it crash (to a black screen) after a number of iterations, not always the same number (5 to 10)

...
	dim3	numBlocks( 232, 232 ) ;

	for ( int i=0; i < 32; i++ ) {

	    Test <<< numBlocks, 64 >>> () ;
	...

If I increase from 232 to 234, it crash around the second iteration. At 512, it always crash at the first iteration.

I’m running on Window 7 pro, service pack-1.

And yes the 1080 drive my 3 screens (res=7680x144)

Now I’m completely lost…

Y_Martel · August 24, 2016, 7:42pm

txbob,

Would you mind emailing me the exec of your test (t1246.exe) so that I can run it here?

I’m using the version 7.5 of the compiler. I don’t think using 8.0 would make a difference, but I’d like to try it.

Yves

martel(AT)acm.org

Robert_Crovella · August 25, 2016, 12:37am

The crash might be due to the WDDM timeout on windows. Have you made any changes to it?

Y_Martel · August 25, 2016, 1:33am

Yes! That’s it!

Thank you txbob.

I changed my timeout from the default 2 to 20 sec and now I can get to 512x512x64.

But I guess I should come up with a better algorithm for my undos in this situation, I don’t mind a couple of second wait, but if it take 20 sec on a GTX1080…

Anybody can suggest something? I have 2 volumes of the same dimensions, and I need a list of the voxels that are different…

Yves

tera · August 25, 2016, 10:40am

A simple measure would be to aggregate the number of changed voxels to the block level, then do a single atomicAdd() per block, and distribute the index back to all threads in the block.