kernel index bug?

For a device function like this:

void __global__ ComputeOutput(float * const C,int const num_in) 
{
	// Grid-Stride Loops
        // learnt from https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
	for (int j_ = blockIdx.x * blockDim.x + threadIdx.x;
			 j_ <  num_in;      
			 j_ += blockDim.x * gridDim.x) {
		C[  j_ ] = float(j_);  
	}
}

is it possible that the output C[j] != j?

I encountered this issue: Most C[j] are j, but a few of them are not j.

The bug is present even if I launch the kernel with 1 thread

ComputeOutput<<<1,1>>>(  d_C, num_in);

You can reproduce the error using my code:

My environment is Matlab 2017a, Ubuntu 16.04 64-bit, CUDA-8.0, Tesla K80.

Update: I do find the error only occurs when j is relatively large (in the order of 16 millions). It’s common for me to deal with such large numbers.

At around 16 million you’ll reach the limit of what can reliably be stored in a float quantity, if you want to test for exact equality with an equivalent integer. This is not unique to CUDA. A float quantity has around 23 bits of mantissa. As a simple test, try it with C as a double array. double should have around 53 bits of mantissa, so you should be able to test for equality beyond 4 billion.

You may want to learn more about the use of floating point arithmetic in computers.

[url]Floating Point and IEEE 754 :: CUDA Toolkit Documentation

Beyond the above comments, testing for exact equality of floating point values has a variety of challenges.

Thanks! I forgot about floating point precision limit…