nvcc -arch sm_20 causes access violations in shared memory

RobadobDE · March 29, 2013, 8:27pm

I’ve been working with some CUDA code I believe to be a couple of years old. It comes as a visual studio solution which I have been moving across to a new solution with 2 other similar models. (I am producing a comparison of 3 pedestrian models).

Anyway, I have been debugging this project, at first I received an Unknown Error killing runtime (nsight later tracked this down to a CUDA_SYNCHRONISE call. After eventually sorting the nvcc parameters to build debug info I was able to find the line of code which causes the Illegal shared memory exception. Prior to using nsight however I had been placing printf’s throughout kernel parts in an attempt to trace down to the cause of the problem.

I know the line of code causing the exception, so as I knew the original solution to work I added the -arch sm20 build parameter and a printf to the same place, to check whether the values being called were the same. This then caused the old solution to have the same runtime error as my code (even with the printf removed). Removing the -arch sm20 parameter allows the code to run again without exception.

Does anyone know where I would be able to find a detailed explanation of the changes that -arch sm20 cause in an attempt to find the cause behind this problem.

For anyone who thinks they might have an idea about the cause, the nsight analysis and breakpointed code are attached below.

CUDA Memory Checker detected 3 threads caused an access violation:
Launch Parameters
    CUcontext    = 07399f60
    CUstream     = 07418fb8
    CUmodule     = 07484750
    CUfunction   = 0ee4fbb8
    FunctionName = _Z26GPUFLAME_avoid_pedestriansP26xmachine_memory_agent_listP41xmachine_message_pedestrian_location_listP40xmachine_message_pedestrian_location_PBMP10RNG_rand48
    GridId       = 194
    gridDim      = {1,1,1}
    blockDim     = {64,1,1}
    sharedSize   = 3200
    Parameters:
        agents = 0x06320000  {_position = {0, 1, 2, -1, -1, -1, -1, -1, -1, -1, ...}, _scan_input = {-25, -5810432, -16777216, -16742457, -16777129, -5773313, -14457, -16777129, -1, -1, ...}, x = {-0.58984375, 0.54296875, 0.83203125, NaN, NaN, NaN, NaN, NaN, NaN, NaN, ...}, y = {-0.90234375, -0.31640625, 0.78515625, NaN, NaN, NaN, NaN, NaN, NaN, NaN, ...}, ...}
        pedestrian_location_messages = 0x08960000  {_position = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}, _scan_input = {0, 1, 2, 3, -842150451, -842150451, -842150451, -842150451, -842150451, -842150451, ...}, x = {-0.58984375, 0.54296875, 0.83203125, 0, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, ...}, y = {-0.90234375, -0.31640625, 0.78515625, 0, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, ...}, ...}
        partition_matrix = 0x08aa0000  {start = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}, end = {-986896, -986896, -986896, -986896, -986896, -986896, -986896, -986896, -986896, -986896, ...}}
        rand48 = 0x098a0000  {A = {...}, C = {...}, seeds = {{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...}}
    Parameters (raw):
         0x06320000 0x08960000 0x08aa0000 0x098a0000
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx      PC  Source
-----------------------------------------------------------------------------------------------
  00000014    16    mis st    s           0       0          {0,0,0}    {0,0,0}  006f90  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:634
  00000044    16    mis st    s           0       1          {0,0,0}    {1,0,0}  006f90  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:634
  00000074    16    mis st    s           0       2          {0,0,0}    {2,0,0}  006f90  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:634


Summary of access violations:
c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu(634): error MemoryChecker: #misaligned=3  #invalidAddress=0
================================================================================

Memory Checker detected 3 access violations.
error = misaligned store (shared memory)
gridid = 194
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x00000014
accessSize = 16

(The line which the code breaks at is marked by >>)

//Using texture cache
  temp_message.x = tex1Dfetch(tex_xmachine_message_pedestrian_location_x, cell_index + d_tex_xmachine_message_pedestrian_location_x_offset); temp_message.y = tex1Dfetch(tex_xmachine_message_pedestrian_location_y, cell_index + d_tex_xmachine_message_pedestrian_location_y_offset); temp_message.z = tex1Dfetch(tex_xmachine_message_pedestrian_location_z, cell_index + d_tex_xmachine_message_pedestrian_location_z_offset); 

	//load it into shared memory (no sync as no sharing between threads)
	int message_index = SHARE_INDEX(threadIdx.x, sizeof(xmachine_message_pedestrian_location));
  printf("prememerror %i, %i, %i\n", d_SM_START, d_PADDING, message_index);
  xmachine_message_pedestrian_location* sm_message = ((xmachine_message_pedestrian_location*)&message_share[message_index]);
>>	sm_message[0] = temp_message;

	return true;

Any input is appreciated, Thanks

RobadobDE · March 29, 2013, 8:36pm

The SHARE_INDEX method from the 2nd code block in the op is defined as (for some reason it won’t let me save any edits to the op).

//PADDING WILL ONLY AVOID SM CONFLICTS FOR 32BIT
//SM_OFFSET REQUIRED AS FERMI STARTS INDEXING MEMORY FROM LOCATION 0 (i.e. NULL)??
__constant__ int d_SM_START;//4 at runtime
__constant__ int d_PADDING;//0 at runtime

//SM addressing macro to avoid conflicts (32 bit only)
#define SHARE_INDEX(i, s) (__mul24((s + d_PADDING), i)+d_SM_START) /**<offset struct size by padding to avoid bank conflicts */

njuffa · March 29, 2013, 8:45pm

I think it is quite likely that the use of -arch=sm_20 is not the cause of the problem, but merely exposes an existing (“latent”) error in the code. Of course, at this stage in the investigation some sort of code generation issue cannot be ruled out, but I consider that less likely. Make sure to use the latest CUDA version (5.0) and recent drivers.

cuda-memcheck complains about an unaligned 16-byte access to an address that is not 16-byte aligned.
I would suggest tracking the address used in the misaligned access backwards from the point of failure to see how this address is being computed. A red flag would for example be the casting of a pointer to a narrower type, say float, to a pointer of a wider type, say float4.

RobadobDE · March 29, 2013, 10:54pm

Following your advice I’ve checked the sizeof() the involved variables/types

message_share is char* which size=4
xmachine_msg type is defined as align(16), where pointer size 4, and struct size is 48

In the original code the assignment of sm_message sets its address to 1000004 which isn’t align16
on changing message_share to char4* the assignment of sm_message sets its address to 1000010 which is align16

This fixes that error, however causes a new one, :(
Thankyou for your insight, I’ll probably be debugging this all night.

P.S.
Do you know what the difference between the Types “mis ld” and “mis st” from the nsight memory watch output (my new bug is ld as opposed to the prior st) as google doesn’t show any relevant results.

njuffa · March 29, 2013, 11:04pm

“ld” = “load”, “st” = “store”. “mis” presumably stands for “misaligned”, but that’s a guess.

RobadobDE · March 29, 2013, 11:24pm

My new error occurs on the line which is calling the function which contained the previous error

if (load_next_pedestrian_location_message(messages, partition_matrix, message->_relative_cell, message->_cell_index_max, message->_agent_grid_cell, message->_cell_index))

The types of the parameters in order are

xmachine_message_pedestrian_location_list* messages
xmachine_message_pedestrian_location_PBM* partition_matrix,
int3 relative_cell, 
int cell_index_max, 
int3 agent_grid_cell, 
int cell_index

whereby the 2 xmachine types are structs defined as

/** struct xmachine_message_pedestrian_location_list
 * Spatial Partitioning
 * Structure of Array for memory coalescing 
 */
struct xmachine_message_pedestrian_location_list
{
    /* Non discrete messages have temp variables used for reductions with optional message outputs */
    int _position [xmachine_message_pedestrian_location_MAX];    /**< Holds agents position in the 1D agent list */
    int _scan_input [xmachine_message_pedestrian_location_MAX];  /**< Used during parallel prefix sum */
    
    float x [xmachine_message_pedestrian_location_MAX];    /**< Message memory variable list x of type float.*/
    float y [xmachine_message_pedestrian_location_MAX];    /**< Message memory variable list y of type float.*/
    float z [xmachine_message_pedestrian_location_MAX];    /**< Message memory variable list z of type float.*/
    
};

/** struct xmachine_message_pedestrian_location_PBM
 * Partition Boundary Matrix (PBM) for xmachine_message_pedestrian_location 
 */
struct xmachine_message_pedestrian_location_PBM
{
	int start[xmachine_message_pedestrian_location_grid_size];
	int end[xmachine_message_pedestrian_location_grid_size];
};

Of these types int and int3 are aligned to 4 and the 2 xmachine types are not marked as aligned. Am I to assume that structs without the align macro are made align(16) anyway?

The new nsight output incase i’ve missed anything;

CUDA Memory Checker detected 3 threads caused an access violation:
Launch Parameters
    CUcontext    = 07bb9f60
    CUstream     = 07708fb8
    CUmodule     = 077746c0
    CUfunction   = 0f1dfc68
    FunctionName = _Z26GPUFLAME_avoid_pedestriansP26xmachine_memory_agent_listP41xmachine_message_pedestrian_location_listP40xmachine_message_pedestrian_location_PBMP10RNG_rand48
    GridId       = 194
    gridDim      = {1,1,1}
    blockDim     = {64,1,1}
    sharedSize   = 3200
    Parameters:
        agents = 0x06320000  {_position = {0, 1, 2, -1, -1, -1, -1, -1, -1, -1, ...}, _scan_input = {-2037007, -2037007, -2037007, -2037007, -2037007, -2037007, -2037007, -2037007, -2037007, -2037007, ...}, x = {-0.58984375, 0.54296875, 0.83203125, NaN, NaN, NaN, NaN, NaN, NaN, NaN, ...}, y = {-0.90234375, -0.31640625, 0.78515625, NaN, NaN, NaN, NaN, NaN, NaN, NaN, ...}, ...}
        pedestrian_location_messages = 0x08960000  {_position = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}, _scan_input = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}, x = {-0.58984375, 0.54296875, 0.83203125, 0, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, ...}, y = {-0.90234375, -0.31640625, 0.78515625, 0, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, -4.3160208e+08, ...}, ...}
        partition_matrix = 0x08aa0000  {start = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}, end = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...}}
        rand48 = 0x098a0000  {A = {...}, C = {...}, seeds = {{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...}}
    Parameters (raw):
         0x06320000 0x08960000 0x08aa0000 0x098a0000
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx      PC  Source
-----------------------------------------------------------------------------------------------
  00000004    16    mis ld    s           0       0          {0,0,0}    {0,0,0}  004c48  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:678
  00000034    16    mis ld    s           0       1          {0,0,0}    {1,0,0}  004c48  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:678
  00000064    16    mis ld    s           0       2          {0,0,0}    {2,0,0}  004c48  c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu:678


Summary of access violations:
c:\users\robadob\dropbox\level3project\code\pedestrianflamegpu\vectorfieldpedmodel\src\dynamic\flamegpu_kernals.cu(678): error MemoryChecker: #misaligned=3  #invalidAddress=0
================================================================================

Memory Checker detected 3 access violations.
error = misaligned load (shared memory)
gridid = 194
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x00000004
accessSize = 16

RobadobDE · March 29, 2013, 11:29pm

Again it won’t let me edit my post, if it is as I assume above. I find it counter intuitive that both structs are untouched from being passed into the current method (containing the break), surely the break should have occurred at entry to this method rather than now.

On editing the 2 structs to be align(16) I’m receiving the same error, so I’ll continue with tracing them backwards.

RobadobDE · March 30, 2013, 12:35am

After realising that all my alignment issues were caused by

char* message_share

should be

char4* message_share

The program now runs through the cuda debugger without issue!
However on running the program outside of the cuda debugger the program fails with ‘unknown error’ (as it previously did before my sm20 issue) at the CUT_CHECK_ERROR after the kernel call;

GPUFLAME_avoid_pedestrians<<<grid, threads, sm_size>>>(d_agents, d_pedestrian_locations, d_pedestrian_location_partition_matrix, d_rand48);

Guess I need to play with the debugger settings until this error appears.

RobadobDE · March 30, 2013, 1:01am

Using the nsight analysis and not the memory debugger it picks up a single error

Call ID|Name                 |CudaError|Start Time (µs)|Duration (µs)|Context ID|Process ID|Thread ID
2263   |cudaDeviceSynchronize|30       |12,862,697.811 |4,551.503    |31        |6244      |6384

I’m unsure how to trace that error though, however I should probably ask in the nsight board, as its now become and nsight specific problem.

Thanks