SASS, LDS.128, LD.128 and DRAM allocation

Hi

I have a many questions about binary code. I used cuobjdump to obtain the SASS code. In this code I found the instruction “LDS.128”. This instruction have been generated automatically by the compiler only for the shared memory and not for the global memory, even if I have the same memory access pattern. I have a for loop of N iterations with 8-strided pattern memory access. This code is an example:

int index = threadIdx.x;
	int globalindex = blockDim.x*blockIdx.x + threadIdx.x;

	if (globalindex < NUM_YUV * NUM_BLOCK_PER_IMAGE * 8){
		//load data on shared memory
		for (int y = 0; y < 8; y++)
			matrix_f[index][y] = img[y + globalindex * 8];

where img is global while matrix_f is shared.

The SASS code generated is the following:

LD R8 , [R7 ] ;
LD R9 , [R7+0x4 ] ; 
LD R10 , [R7+0x8 ] ;
LD R11 , [R7+0xc ] ;
LD R4 , [R7+0x10 ] ;
LD R5 , [R7+0x14 ] ;
LD R6 , [R7+0x18 ] ;
LD R7 , [R7+0x1c ] ;
STS.128 [R2 ] , R8 ;
STS.128 [R2+0x10 ] , R4 ;

The questions are:

  1. why this optimization( STS.128) is applied only for Shared Memory and not also to global memory?
  2. why only 126 bit and not also 256 or 512? Is this related to bus width?
  3. There is a guide for SASS Code? Where I can find something about this language?

Another question is about DRAM allocation. If for example I have a 6 DRAM Controllers each of 64 bit and I have a vector of 32 integer, are these contiguos integers spreaded about the DRAM block or not (All 32 intgeres resides on the same DRAM block)?

Thank you

1 Like

Asking why the compiler applies a certain optimization calls for speculation without detailed knowledge of the internals of the compiler in question, and the exact complete code being compiled.

Keep in mind that on the GPU, all accesses must be naturally aligned. Wide loads can only be used when that alignment can be guaranteed. It may be easier for the compiler to “prove” proper alignment in this particular case of a shared memory object. The GPU hardware does not provide loads/stores that are larger than 128 bit. Other than very brief descriptions of SASS instructions, NVIDIA does not, to my knowledge, make details of SASS available; note that the machine language of each major GPU family is different. One can reasonably reverse-engineer much of the machine language by just looking at disassembled machine code.

1 Like

Probably difficult to answer without a short complete code. May be difficult to answer even then. I’m actually pretty surprised that the compiler made that optimization even in the shared case.

CUDA fundamentally is limited to a maximum transaction size per thread of 128 bits. This is covered in the documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

(128 bits = 16 bytes)

The only officially published “guide” I am aware of is here:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref

You can also learn things inferentially about SASS by studying PTX:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#abstract

However if you study the maxas work by Scott Gray, you would likely learn a lot more about SASS than anything from the NVIDIA documentation:

https://github.com/NervanaSystems/maxas

AFAIK the exact storage patterns are not defined and in fact may vary from GPU generation to generation.

DRAM has a certain “transaction size” and so data within the transaction size will be contiguous per memory controller. However a larger data region will generally be “swizzled” or “hashed” in some non-obvious pattern across memory controllers, so as to provide full bandwidth while avoiding partition camping effects for common data access patterns (e.g. contiguous access, columnar access, etc.)

I am not an expert on GPU memory design. My statements about hashing and partition camping come from this work, which itself is quite dated:

http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf

1 Like

Thanks a lot Njuffa and txbob. Now I read the linked documents.

In the case of shared memory, unless it is dynamically sized, the compiler can easily establish alignment as the starting address of each object is known at compile time. It could even actively force suitable alignment by placing the object in shared memory appropriately, but I don’t have evidence that this is occurring. As long as register pressure does not indicate otherwise, vectorizing aligned accesses should generally benefit performance.

In the case of data objects in global memory, addresses are mostly not known until run-time, precluding simple vectorization of accesses, though it may be achievable by unrolling and/or cloning of loops, checking addresses at run-time, and handling end-cases. I don’t know that nvcc provides such extensive code reoganization just to be able to use wide accesses.

1 Like

Thanks, good insight. It wasn’t the alignment part that I was confused about, but the fact that the compiler could vectorize disparate accesses.

I was aware that compiler could break a larger access into multiple smaller accesses (e.g. to satisfy native alignment) but I wasn’t aware that the compiler could easily combine disparate accesses into a single vectorized instruction. I had always assumed that you would have to coax the compiler to generate a vectorized load or store by making it obvious how to do so at the C source code level (by using vector types, at least). Cool.

1 Like

I don’t know which part of the compiler actually performs the particular vectorization performed here. This is why all “analysis” as to why things work the way they do would have to be speculative.

To first order a load is a load is a load, regardless of which data object it belongs to, so the load combining is presumably purely mechanical. However, to guarantee correct operation when merging loads from multiple data objects the compiler would probably have to establish that there is no aliasing, which again is typically much easier for shared memory objects.

In general it is best to use short-vector data types in the source code to achieve vectorized loads and stores. Optimizations can change over time, and might even be disabled based on cost/benefit analysis (e.g. average increase in compile time vs average run-time improvement).

1 Like

the way i learned sass is

  1. ptx manual: http://docs.nvidia.com/cuda/parallel-thread-execution/
  2. http://docs.nvidia.com/cuda/cuda-binary-utilities/#instruction-set-ref
  3. GitHub - laanwj/decuda: Decuda and cudasm, the CUDA binary utilities package. Low-level tools for NVidia G80 GPUs.
  4. read wiki of asfermi project: Home · hyqneuron/asfermi Wiki · GitHub
  5. read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide
  6. there is also maxas, but its docs doesn’t describe commands

128-bit alignment is guaranteed for datatypes like float4, so you can try to convert to these types to let compiler know that your make an aligned access

afair, each next 256 bytes goes to the next memory controller. may be 128 or 512 in some gpus. it may be described in Dissecting GPU Memory Hierarchy through Microbenchmarking | hgpu.org as interpretation of address bits - i don’t remember exactly