Shared Memory Debug Errors in NSight
System Configuration: Windows 7 Cuda 8.0.61 Driver 376.51 NSight 5.2.0.16321 I've recently had an unusual problem debugging simple GPU code in Nsight. A few months ago I had written a GPU algorithm which ran perfectly well on my 980 Ti. My co-worker had recently build the algorithm on his machine with a 1060, but was getting bounds check errors in Cuda Debug mode. I went back to run again on my machine under the same conditions with the 980 Ti and everything was fine. I then tried a 960, 1060, and 1080 card on my machine and ALL showed the same bounds check errors. Cuda-memcheck reported NO errors for all of the video cards. I've created a very simple function that reproduces the weird behavior. You can see all the code is doing is setting up a block of shared memory, then iterating though it. All of the cards I've tried list 48k as their max shared memory size, so 15360 shouldn't be a problem. The problem doesn't appear on the 980 Ti, but does on a 960, 1060, and 1080. Thanks An example error that gets reported is: Summary of access violations: c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0 ================================================================================ Memory Checker detected 1 access violations. error = misaligned store (global memory) gridid = 5 blockIdx = {17,6,0} threadIdx = {0,0,0} address = 0x10001000523 accessSize = 1 [code] #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <assert.h> // CUDA runtime #include <cuda_runtime.h> // helper functions and utilities to work with CUDA #include <helper_cuda.h> #include <helper_functions.h> __global__ void kernel_big() { const int MEMSIZE = 15360; __shared__ unsigned char data[MEMSIZE]; if (threadIdx.x == 0 && threadIdx.y == 0) { for (int i = 0; i < MEMSIZE; i++) { unsigned char tmp = data[i]; } for (int i = 0; i < MEMSIZE; i++) { data[i] = 42; } } __syncthreads(); } //////////////////////////////////////////////////////////////////////////////// //! Entry point for Cuda functionality on host side //////////////////////////////////////////////////////////////////////////////// extern "C" void runTest(const int argc, const char **argv) { // use command-line specified CUDA device, otherwise use device with highest Gflops/s findCudaDevice(argc, (const char **)argv); kernel_big << <dim3(24, 24, 1), dim3(16, 16, 1) >> >(); // check if kernel execution generated and error getLastCudaError("Kernel execution failed"); } [/code]
System Configuration:
Windows 7
Cuda 8.0.61
Driver 376.51
NSight 5.2.0.16321

I've recently had an unusual problem debugging simple GPU code in Nsight. A few months ago I had written a GPU algorithm which ran perfectly well on my 980 Ti. My co-worker had recently build the algorithm on his machine with a 1060, but was getting bounds check errors in Cuda Debug mode. I went back to run again on my machine under the same conditions with the 980 Ti and everything was fine. I then tried a 960, 1060, and 1080 card on my machine and ALL showed the same bounds check errors. Cuda-memcheck reported NO errors for all of the video cards. I've created a very simple function that reproduces the weird behavior. You can see all the code is doing is setting up a block of shared memory, then iterating though it. All of the cards I've tried list 48k as their max shared memory size, so 15360 shouldn't be a problem. The problem doesn't appear on the 980 Ti, but does on a 960, 1060, and 1080. Thanks

An example error that gets reported is:

Summary of access violations:
c:\programdata\nvidia corporation\cuda samples\v8.0\0_simple\cppintegration\cppintegration.cu(47): error MemoryChecker: #misaligned=1 #invalidAddress=0
================================================================================

Memory Checker detected 1 access violations.
error = misaligned store (global memory)
gridid = 5
blockIdx = {17,6,0}
threadIdx = {0,0,0}
address = 0x10001000523
accessSize = 1



#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// helper functions and utilities to work with CUDA
#include <helper_cuda.h>
#include <helper_functions.h>

__global__ void kernel_big()
{
const int MEMSIZE = 15360;
__shared__ unsigned char data[MEMSIZE];

if (threadIdx.x == 0 && threadIdx.y == 0)
{
for (int i = 0; i < MEMSIZE; i++)
{
unsigned char tmp = data[i];
}

for (int i = 0; i < MEMSIZE; i++)
{
data[i] = 42;
}
}

__syncthreads();
}

////////////////////////////////////////////////////////////////////////////////
//! Entry point for Cuda functionality on host side
////////////////////////////////////////////////////////////////////////////////
extern "C" void
runTest(const int argc, const char **argv)
{
// use command-line specified CUDA device, otherwise use device with highest Gflops/s
findCudaDevice(argc, (const char **)argv);

kernel_big << <dim3(24, 24, 1), dim3(16, 16, 1) >> >();

// check if kernel execution generated and error
getLastCudaError("Kernel execution failed");

}

#1
Posted 04/24/2017 01:21 PM   
Confirmed, it exists on my 1070 but doesn't exist on my 980Ti, also it doesn't exist on cuda-memcheck, I guess you should use cuda-memcheck to test the memory leak, sorry for the inconvenience
Confirmed, it exists on my 1070 but doesn't exist on my 980Ti, also it doesn't exist on cuda-memcheck, I guess you should use cuda-memcheck to test the memory leak, sorry for the inconvenience

#2
Posted 04/25/2017 06:26 AM   
Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.
Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.

#3
Posted 04/25/2017 12:41 PM   
[quote=""]Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.[/quote] Hi, I don't see this issue on nsight 5.3, could you please check the latest version?
said:Thanks for confirming the problem Harry. Hopefully this will get on the list of bugs to fix because it makes debugging Cuda code more difficult.


Hi, I don't see this issue on nsight 5.3, could you please check the latest version?

#4
Posted 07/21/2017 03:38 AM   
I'll give it a try and get back to you. Thanks for testing this out.
I'll give it a try and get back to you. Thanks for testing this out.

#5
Posted 07/21/2017 03:11 PM   
Scroll To Top

Add Reply