CUDA racecheck, shared memory array and cudaDeviceSynchronize()
I posted this problem on Stack Overflow but never got an answer (see [url]http://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize[/url]). I may be luckier here. This was tested on Linux with: [i]GPU: GeForce GT 650M Driver Version: 313.09 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA Corporation Built on Fri_Sep_21_17:28:58_PDT_2012 Cuda compilation tools, release 5.0, V0.2.1221[/i] Consider the following simple program: [code] template __global__ void kernel_test() { const int SIZE_X = 4; const int SIZE_Y = 4; __shared__ float tmp[SIZE_X*SIZE_Y*NTHREADS]; for (unsigned int i = 0; i < SIZE_X; i++) for (unsigned int j = 0; j < SIZE_Y; j++) tmp[i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] = threadIdx.x; } int main() { const unsigned int NTHREADS = 32; //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine kernel_test<NTHREADS><<<64, NTHREADS>>>(); cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32 } [/code] This can be compiled with [b]nvcc test.cu --ptxas-options=-v -o test[/b] Running the program with cuda-memcheck and the racecheck tool: [b]cuda-memcheck --tool racecheck test[/b], I get errors depending on the number of blocks, and this seems to be caused by [b]cudaDeviceSynchronize()[/b]. The errors detected by the tool look like this: [i] ========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) : ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void) ========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void) ========= Current Value : 0, Incoming Value : 128 ========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) : ========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void) ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void) ========= Current Value : 0, Incoming Value : 0 [/i] Let's consider two different cases: [list] [.][b]kernel_test();[/b] : 32 blocks, 32 threads => does not lead to any apparent racecheck error.[/.] [.][b]kernel_test();[/b] : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.[/.] [/list] So what am I doing wrong here? Am I doing something wrong with shared memory?
I posted this problem on Stack Overflow but never got an answer (see http://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize). I may be luckier here.

This was tested on Linux with:

GPU: GeForce GT 650M
Driver Version: 313.09
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221


Consider the following simple program:

template 
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;

__shared__ float tmp[SIZE_X*SIZE_Y*NTHREADS];

for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] = threadIdx.x;
}

int main()
{
const unsigned int NTHREADS = 32;

//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();

cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}


This can be compiled with nvcc test.cu --ptxas-options=-v -o test

Running the program with cuda-memcheck and the racecheck tool: cuda-memcheck --tool racecheck test, I get errors depending on the number of blocks, and this seems to be caused by cudaDeviceSynchronize().

The errors detected by the tool look like this:


========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128

========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0


Let's consider two different cases:
  • kernel_test(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
  • kernel_test(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.


So what am I doing wrong here? Am I doing something wrong with shared memory?

#1
Posted 01/10/2013 01:57 PM   
For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize(). As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is. To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4. When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511. Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS
For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().

As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is.

To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4.
When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.

Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS

#2
Posted 01/11/2013 01:58 AM   
I concur with alrikai's analysis. @alrikai: If you are on Stackoverflow, it would be great if you could post your answer there. I would be happy to vote it up.
I concur with alrikai's analysis. @alrikai: If you are on Stackoverflow, it would be great if you could post your answer there. I would be happy to vote it up.

#3
Posted 01/11/2013 03:05 AM   
@njuffa That sounds good, I just did so. Thanks
@njuffa That sounds good, I just did so. Thanks

#4
Posted 01/11/2013 04:07 AM   
Thanks, you got 10 more points now :-)
Thanks, you got 10 more points now :-)

#5
Posted 01/11/2013 04:27 AM   
Oh I see, nice catch! And the "memcheck" tool of cuda-memcheck cannot catch shared memory errors, only "racecheck" is able to do that?
Oh I see, nice catch! And the "memcheck" tool of cuda-memcheck cannot catch shared memory errors, only "racecheck" is able to do that?

#6
Posted 01/11/2013 09:41 AM   
Also, for the code, I have been twisting the original code to test things out. If one considers this, which is actually closer to what I had (I do not know why I started linearizing everything, since I even added an error...): [code] __shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS]; for (unsigned int i = 0; i < SIZE_X; i++) for (unsigned int j = 0; j < SIZE_Y; j++) tmp[i][j][threadIdx.x] = threadIdx.x; [/code] I get the same error. Yet, shared memory is of size SIZE_X * SIZE_Y * NTHREADS, and I believe that I stay within the memory range. So what is going on here?
Also, for the code, I have been twisting the original code to test things out.

If one considers this, which is actually closer to what I had (I do not know why I started linearizing everything, since I even added an error...):

__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];

for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;


I get the same error. Yet, shared memory is of size SIZE_X * SIZE_Y * NTHREADS, and I believe that I stay within the memory range. So what is going on here?

#7
Posted 01/11/2013 09:49 AM   
Does it happen only for NBLOCKS > NTHREADS?
Does it happen only for NBLOCKS > NTHREADS?

#8
Posted 01/11/2013 10:49 PM   
[quote="alrikai"]Does it happen only for NBLOCKS &gt; NTHREADS? [/quote] Apparently yes.
alrikai said:Does it happen only for NBLOCKS &gt; NTHREADS?


Apparently yes.

#9
Posted 01/12/2013 12:27 PM   
Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?
Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?

#10
Posted 01/24/2013 12:50 AM   
[quote="vyas"]Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?[/quote] I just tested with the latest linux drivers (313.18) and apparently I do not see the error anymore.
vyas said:Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?


I just tested with the latest linux drivers (313.18) and apparently I do not see the error anymore.

#11
Posted 02/05/2013 03:02 AM   
Scroll To Top