race conditions inside thrust,cublas calls?

Hello , in my code I am using thrust and cublas functions in order to compute inclusive_scan and transpose.

Checking for race conditions

cuda-memcheck --tool racecheck --racecheck-report analysis ./mycode

it gives me a lot of:

Race reported between Read access at 0x0000e568 in /usr/local/cuda/bin/..//include/thrust/system/cuda/detail/detail/fast_scan.inl:349:void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
=========     and Write access at 0x0000ebb8 in /usr/local/cuda/bin/..//include/thrust/system/cuda/detail/detail/fast_scan.inl:361:void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1) [892 hazards]
......
Race reported between Read access at 0x0000ed50 in /usr/local/cuda/bin/..//include/thrust/system/cuda/detail/detail/fast_scan.inl:349:void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, thrust::device_ptr<float>, thrust::device_ptr<float>, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
=========     and Write access at 0x0000f3a0 in /usr/local/cuda/bin/..//include/thrust/system/cuda/detail/detail/fast_scan.inl:361:void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, thrust::device_ptr<float>, thrust::device_ptr<float>, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1) [4460 hazards]
.....
......

Race reported between Write access at 0x000003f8 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *)
=========     and Write access at 0x00000320 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [384 hazards]
=========     and Write access at 0x00000620 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [47 hazards]
=========     and Write access at 0x000003f8 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [2952 hazards]
=========     and Write access at 0x000004a0 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [1729 hazards]
......
......
Race reported between Write access at 0x00000320 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *)
=========     and Write access at 0x000003f8 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [384 hazards]
=========
.....
.....

Do I have to take these into consideration or not?

Also , in thrust if you can check it shows:

include/thrust/system/cuda/detail/detail/fast_scan.inl:349:
/include/thrust/system/cuda/detail/detail/fast_scan.inl:361

I found these :

https://github.com/cbuchner1/ccminer/blob/master/compat/thrust/system/cuda/detail/detail/fast_scan.inl
http://code.google.com/p/thrust/source/browse/trunk/thrust/detail/device/cuda/detail/fast_scan.inl?repo=wiki&r=cc9f8a792432fe53e994524ebc9b46554d4da176

but I am not sure what to do.

Thanks

Hi ggeo, could you post the complete source of the application (and the build command) that exhibits these races ? Also, could you provide the following :

  1. Which version of the CUDA toolkit and NVIDIA driver is on the system
  2. The OS being used
  3. The GPU being used

Thanks

Hello ,

unfortunately I can’t publish the code ,sorry.

I am compiling with :

nvcc -arch=sm_35 -g -G  -c app.cu -lcublas  -lcudart
  1. CUDA 6.0 , Cuda compilation tools, release 6.0, V6.0.1

and driver

NVIDIA-SMI 331.62 Driver Version: 331.62

  1. Description: Red Hat Enterprise Linux Server release 6.5 (Santiago)
    Release: 6.5

    Kernel : 2.6.32-431.11.2.el6.x86_64

  2. 3D controller: NVIDIA Corporation GK110GL [Tesla K20Xm] (rev a1)
    Subsystem: NVIDIA Corporation Device 097d

Also , some messages from cublas calls:

Race reported between Write access at 0x000003f8 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *)
=========     and Write access at 0x000004a0 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [1816 hazards]
=========     and Write access at 0x00000620 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [180 hazards]
=========     and Write access at 0x00000320 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [384 hazards]
=========     and Write access at 0x000003f8 in void transpose_readWrite_alignment_kernel<float, int=1, bool=0, int=6, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *) [2842 hazards]

Do I have to take into consideration these messages?

Thanks

Thanks ggeo. Could you also provide the following :

  1. The actual numeric parameter values passed into the cublas transpose call
  2. The results of running with “cuda-memcheck --tool racecheck --racecheck-report all --print-level error ./mycode”
  1. I am using a loop where I process every image:
for (int i = 0; i < N; i++) //N number of images
	
     cublasSgeam(handle,CUBLAS_OP_T, CUBLAS_OP_T, Rows , Cols , &alpha ,devRow + i * Rows * Cols ,Cols , &beta ,devRow + i * Rows * Cols , Cols , devTransRow + i * Rows * Cols ,Rows);

Rows = 80 , Cols = 64

cudaMalloc( (void**) &devRow, Rows*Cols*N * sizeof(float));
cudaMalloc( (void**) &devTransRow, Rows*Cols*N * sizeof(float));

The errors remain for 1 image also.

The devRow values are like:

-0.033261
-0.030971
-0.028928
0.078255
0.078208
0.077341
0.076870
0.075201
0.073764
0.071980
0.070063

but every time they are different.I can provide you a binary if you like.

  1. I think I can’t provide you all the results because :
RACECHECK SUMMARY: 616 hazards displayed (616 errors, 20832 warnings)

!

The errors refer to thrust and cublas.

Now, I tried to run some of the examples from thrust examples and I have the same issues

...
Race reported between Read access at 0x00000df8 in void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=160>>>(bool=1)
=========     and Write access at 0x00000ed8 in void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=160>>>(bool=1) [636 hazards]
=========

...
ERROR: Potential WAR hazard detected at __shared__ 0xd10 in block (0, 0, 0) :
=========     Read Thread (32, 0, 0) at 0x00000df8 in void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=160>>>(bool=1)
=========     Write Thread (31, 0, 0) at 0x00000ed8 in void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=160>>>(bool=1)
=========     Current Value : 0, Incoming Value : 0
...
(RACECHECK SUMMARY: 34 hazards displayed (34 errors, 1240 warnings)

Examples : simple_moving_average , scan_by_key

Also , running the simpleDevLibCUBLAS example from cublas , I am receiving :

...
ERROR: Potential RAW hazard detected at __shared__ 0x2101 in block (4, 0, 0) :
=========     Write Thread (15, 7, 0) at 0x00000008 in sgemm_sm35_ldg_nn_64x16x64x16x16
=========     Read Thread (0, 14, 0) at 0x000001d8 in sgemm_sm35_ldg_nn_64x16x64x16x16
=========     Current Value : 0
...
= ERROR: Potential RAW hazard detected at __shared__ 0x2100 in block (0, 1, 0) :
=========     Write Thread (15, 15, 0) at 0x00000008 in sgemm_sm35_ldg_nn_64x16x64x16x16
=========     Read Thread (0, 8, 0) at 0x000001d8 in sgemm_sm35_ldg_nn_64x16x64x16x16
=========     Current Value : 0
....
Race reported between Write access at 0x00000008 in sgemm_sm35_ldg_nn_64x16x64x16x16
=========     and Read access at 0x000001d8 in sgemm_sm35_ldg_nn_64x16x64x16x16 [1594 hazards]
=========     and Write access at 0x00000008 in sgemm_sm35_ldg_nn_64x16x64x16x16 [101403 hazards]
...

RACECHECK SUMMARY: 3514 hazards displayed (3514 errors, 0 warnings)

I also updated driver to 331.79

I am compiling with or without the -g -G .I also tried sm 35 , sm 30 , sm 20.

Thanks

The problem should be solved in Thrust 1.8 . Could you please verify that the race conditions are gone. Thanks!

Hello ,

unfortunately I don’t have access any more to that code, sorry.