On my GTX1080, it will work for the steps up to 256x256x64 (i=0 to 3) but crash on 512x512x64 (i=4).
Now, before you tell me, I know that atomicAdd serialize the instructions and slow down everybody. I use it in my undos. I compare 2 volumes and get a list of changes. Usually, these are small and I keep my real time interaction. But some operations affect a big chunk of the volume. These come from a menu selection in my application, and I don’t mind waiting a bit for the undo processing, but I do mind a crash!
I guess there is a limit associated with atomicAdd? I didn’t see any mentions of that in the doc…
What happens if you add proper status checks to all CUDA API functions and all kernel launches? Alternatively (or better: additionally), what happens if you run your code under the control of cuda-memcheck, with all relevant checking tools enabled?
Bonjour njuffa and thank you for your suggestions.
But, there’s no memory involved (just one unsigned int, the variable “count” and it crash before it overflow (512x512x64 = 0x0100 0000).
and as far as “proper status check”, I only have 1 launch, and I do test it. The only API call would be the atomicAdd, and I don’t think there’s a “proper status check” on that one?
The code I presented is not an excerpt, that’s the entire code! It’s just to demonstrate the problem, I run it all by itself and… crash!
I changed my test a little bit, now I run multiple times with the same parameters (it seem to have problems when the numBlocks reach 232) and it crash (to a black screen) after a number of iterations, not always the same number (5 to 10)
...
dim3 numBlocks( 232, 232 ) ;
for ( int i=0; i < 32; i++ ) {
Test <<< numBlocks, 64 >>> () ;
...
If I increase from 232 to 234, it crash around the second iteration. At 512, it always crash at the first iteration.
I’m running on Window 7 pro, service pack-1.
And yes the 1080 drive my 3 screens (res=7680x144)
I changed my timeout from the default 2 to 20 sec and now I can get to 512x512x64.
But I guess I should come up with a better algorithm for my undos in this situation, I don’t mind a couple of second wait, but if it take 20 sec on a GTX1080…
Anybody can suggest something? I have 2 volumes of the same dimensions, and I need a list of the voxels that are different…
A simple measure would be to aggregate the number of changed voxels to the block level, then do a single atomicAdd() per block, and distribute the index back to all threads in the block.