Shared Memory Debug Errors in NSight

KirkG · April 24, 2017, 1:21pm

System Configuration:
Windows 7
Cuda 8.0.61
Driver 376.51
NSight 5.2.0.16321

I’ve recently had an unusual problem debugging simple GPU code in Nsight. A few months ago I had written a GPU algorithm which ran perfectly well on my 980 Ti. My co-worker had recently build the algorithm on his machine with a 1060, but was getting bounds check errors in Cuda Debug mode. I went back to run again on my machine under the same conditions with the 980 Ti and everything was fine. I then tried a 960, 1060, and 1080 card on my machine and ALL showed the same bounds check errors. Cuda-memcheck reported NO errors for all of the video cards. I’ve created a very simple function that reproduces the weird behavior. You can see all the code is doing is setting up a block of shared memory, then iterating though it. All of the cards I’ve tried list 48k as their max shared memory size, so 15360 shouldn’t be a problem. The problem doesn’t appear on the 980 Ti, but does on a 960, 1060, and 1080. Thanks

An example error that gets reported is:

Summary of access violations:
c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0

Memory Checker detected 1 access violations.
error = misaligned store (global memory)
gridid = 5
blockIdx = {17,6,0}
threadIdx = {0,0,0}
address = 0x10001000523
accessSize = 1

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// helper functions and utilities to work with CUDA
#include <helper_cuda.h>
#include <helper_functions.h>

__global__ void kernel_big()
{
	const int MEMSIZE = 15360;
	__shared__ unsigned char data[MEMSIZE];

	if (threadIdx.x == 0 && threadIdx.y == 0)
	{
		for (int i = 0; i < MEMSIZE; i++)
		{
			unsigned char tmp = data[i];
		}

		for (int i = 0; i < MEMSIZE; i++)
		{
			data[i] = 42;
		}
	}

	__syncthreads();
}

////////////////////////////////////////////////////////////////////////////////
//! Entry point for Cuda functionality on host side
////////////////////////////////////////////////////////////////////////////////
extern "C" void
runTest(const int argc, const char **argv)
{
    // use command-line specified CUDA device, otherwise use device with highest Gflops/s
    findCudaDevice(argc, (const char **)argv);
	
	kernel_big << <dim3(24, 24, 1), dim3(16, 16, 1) >> >();

    // check if kernel execution generated and error
    getLastCudaError("Kernel execution failed");

}

harryz · April 25, 2017, 6:26am

Confirmed, it exists on my 1070 but doesn’t exist on my 980Ti, also it doesn’t exist on cuda-memcheck, I guess you should use cuda-memcheck to test the memory leak, sorry for the inconvenience

KirkG · April 25, 2017, 12:41pm

Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.

harryz · July 21, 2017, 3:38am

Hi, I don’t see this issue on nsight 5.3, could you please check the latest version?

KirkG · July 21, 2017, 3:11pm

I’ll give it a try and get back to you. Thanks for testing this out.

Shared Memory Debug Errors in NSight

Summary of access violations: c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0

Summary of access violations:
c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0