Do const __restrict__ pointers *ever* generate LDG.CI loads on CUDA 7?

I’m very very clear on what __ldg() does and when to use it but am surprised to see that, as far I can tell, none of my dutifully decorated pointers across a large CUDA codebase generate an LDG.CI.

On CUDA 7 and sm_50, it appears that an explicit call to __ldg() is required to generate the expected constant load.

Is this known? The “likelihood” of detection appears to be approaching zero.

So is there a difference between CUDA 6.5 and CUDA 7.0 in this regard?

Is looking at the verbose compile output a valid indicator of a constant load?

For example:

1>0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 48 registers, 2884 bytes smem, 376 bytes cmem[0], 12 bytes cmem[2]

What is the difference between cmem[0] and cmem[2] ?

I don’t observe this casually (1 datapoint).

When I compile the code in the answer here:

http://stackoverflow.com/questions/27239835/parallelizing-a-for-loop-1d-naive-convolution-in-cuda

(the one that has been decorated with const restrict) for either sm_35, sm_50, or sm_52 on CUDA 7 RC, and dump the sass, I find LDG instructions.

fully worked example, based on code above:

$ cat test.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

// RG*RG*MAXN must fit within mytype

#define MAXN 100000
#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256

typedef double mytype;

void conv(const mytype *A, const mytype *B, mytype* out, int N) {

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}

unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}



__global__ void conv_Kernel2(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *out, const int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (2*N)-1){
      mytype my_sum = 0;
      for (int i = 0; i < N; i++)
        if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
      out[idx] = my_sum;
    }
}

int main(int argc, char *argv[]){


  mytype *h_A, *d_A, *h_result, *d_result, *result, *h_B, *d_B, *A, *B;
  if (argc != 2) {printf("must specify N on the command line\n"); return 1;}
  int my_N = atoi(argv[1]);
  if ((my_N < 1) || (my_N > MAXN)) {printf("N out of range\n"); return 1;}
  B   = (mytype *)malloc(my_N*sizeof(mytype));
  A   = (mytype *)malloc(my_N*sizeof(mytype));
  h_A = (mytype *)malloc(my_N*sizeof(mytype));
  h_B = (mytype *)malloc(my_N*sizeof(mytype));
  h_result = (mytype *)malloc(2*my_N*sizeof(mytype));
  result   = (mytype *)malloc(2*my_N*sizeof(mytype));

  cudaMalloc(&d_B, my_N*sizeof(mytype));
  cudaMalloc(&d_A, my_N*sizeof(mytype));
  cudaMalloc(&d_result, 2*my_N*sizeof(mytype));

  for (int i=0; i < my_N; i++){
    A[i] = rand()%RG;
    B[i] = rand()%RG;
    h_A[i] = A[i];
    h_B[i] = B[i];}

  for (int i=0; i < 2*my_N; i++){
    result[i]   = 0;
    h_result[i] = 0;}

  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, result, my_N);
  cpu_time = dtime_usec(cpu_time);

  cudaMemset(d_result, 0, 2*my_N*sizeof(mytype));

  unsigned long long gpu_time = dtime_usec(0);
  cudaMemcpy(d_A, h_A, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  conv_Kernel2<<<((2*(my_N-1))+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_result, my_N);
  cudaDeviceSynchronize();
  cudaMemcpy(h_result, d_result, 2*my_N*sizeof(mytype), cudaMemcpyDeviceToHost);
  gpu_time = dtime_usec(gpu_time);

  for (int i = 0; i < 2*my_N; i++) if (result[i] != h_result[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, result[i], h_result[i]); return 1;}
  printf("Finished.  Results match.  cpu time: %ldus, gpu time: %ldus\n", cpu_time, gpu_time);
  printf("cpu/gpu = %f\n", cpu_time/(float)gpu_time);
  return 0;
}
$ nvcc -arch=sm_50 -o test test.cu
$ cuobjdump -sass test |grep LDG
        /*0138*/               @P0 LDG.E.CI.64 R12, [R2];                                      /* 0xeed5a0000000020c */
        /*0148*/               @P0 LDG.E.CI.64 R6, [R4];                                       /* 0xeed5a00000000406 */
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Tue_Dec__9_18:10:46_CST_2014
Cuda compilation tools, release 7.0, V7.0.17
$

@CudaaduC, PTXAS output won’t help. You can only tell if LDG.CI’s are being generated by dumping the SASS.

@txbob, thanks for the simple counter-example to reassure me that they’re being generated.

I dug up code from a previously filed LDG bug report and added another test case and… now have a reproducer.

Filing a bug (#1617935).

Synopsis:

Packaging up your properly decorated pointers into a const struct which is then passed to the kernel appears to ignore the const _restrict qualifiers:

That’s a subtle bug catch!
NVidia should award you two Tshirts for that bug report instead of just one as usual. Or make it a different color to stand out at GTC.

I am not enough of a C/C++ language lawyer, but in my experience the various modifiers/qualifiers (const, volatile, restrict) have non-obvious semantics once they are being used in non-trivial situations. I learned that the hard way. Best I recall, packaging up attributed pointers in a struct is one of those non-trivial uses that results in unintuitive semantics. Even after re-reading the relevant standards and talking to compiler engineers I never developed a comprehensive understanding about these attributes.

So while it is entirely possible that the compiler misses a chance for optimization here, I think it is also possible that the semantics prescribed by C/C++ simply do not allow the compiler to extract all the information it needs to determine that use of LDG is in fact safe.

Filing a bug is of course an excellent course of action, as it will bring clarity about the situation. The observed behavior will either be accepted as a bug or the desired behavior accepted as an RFE, or you willl get feedback that everything works as it should.

For now I would suggest sticking the const and restrict modifiers on simple pointer arguments to functions and leaving it at that.

I was waiting for a lawyer to point out the exact line of the C++ spec where my expectations were incorrect but it looks like I still have legal standing. :)

Until then, there are two reasons to use the idiom I’m flogging:

  1. Clarity: it packages the constant kernel args under one name.
  2. Efficiency: one marshalling call on the host instead of one per kernel argument.

The defense rests!

@SPWorley, I’m definitely a redshirt when it comes to CUDA bugs.

Posting of pictures to contain code seems incovenient. Perhaps that is your intent - you don’t want your code to be searchable or easily copyable?

Or is there a clever way to copy the code from that picture if I wanted to play with it?

Nope, that is never my intent.

I usually throw snippets up on gist but it was easier to cut and paste color-coded text since the actual bug report, which has a link, also has the code.

You can now find the snippet here.

It would be great if someone could find a quick workaround!