System Configuration:
Windows 7
Cuda 8.0.61
Driver 376.51
NSight 5.2.0.16321
I’ve recently had an unusual problem debugging simple GPU code in Nsight. A few months ago I had written a GPU algorithm which ran perfectly well on my 980 Ti. My co-worker had recently build the algorithm on his machine with a 1060, but was getting bounds check errors in Cuda Debug mode. I went back to run again on my machine under the same conditions with the 980 Ti and everything was fine. I then tried a 960, 1060, and 1080 card on my machine and ALL showed the same bounds check errors. Cuda-memcheck reported NO errors for all of the video cards. I’ve created a very simple function that reproduces the weird behavior. You can see all the code is doing is setting up a block of shared memory, then iterating though it. All of the cards I’ve tried list 48k as their max shared memory size, so 15360 shouldn’t be a problem. The problem doesn’t appear on the 980 Ti, but does on a 960, 1060, and 1080. Thanks
An example error that gets reported is:
Summary of access violations:
c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0
Memory Checker detected 1 access violations.
error = misaligned store (global memory)
gridid = 5
blockIdx = {17,6,0}
threadIdx = {0,0,0}
address = 0x10001000523
accessSize = 1
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// CUDA runtime
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_cuda.h>
#include <helper_functions.h>
__global__ void kernel_big()
{
const int MEMSIZE = 15360;
__shared__ unsigned char data[MEMSIZE];
if (threadIdx.x == 0 && threadIdx.y == 0)
{
for (int i = 0; i < MEMSIZE; i++)
{
unsigned char tmp = data[i];
}
for (int i = 0; i < MEMSIZE; i++)
{
data[i] = 42;
}
}
__syncthreads();
}
////////////////////////////////////////////////////////////////////////////////
//! Entry point for Cuda functionality on host side
////////////////////////////////////////////////////////////////////////////////
extern "C" void
runTest(const int argc, const char **argv)
{
// use command-line specified CUDA device, otherwise use device with highest Gflops/s
findCudaDevice(argc, (const char **)argv);
kernel_big << <dim3(24, 24, 1), dim3(16, 16, 1) >> >();
// check if kernel execution generated and error
getLastCudaError("Kernel execution failed");
}