GPU cache coherence problem

Hello,

I am trying to understand how a GPU manages its L2 cache. I did an experiment to confirm when cache is coherent with memory on GPU.

Two kernels are launched on two GPU (GPU0 and GPU1). Each kernel is set to start one thread. The thread on GPU0 reads a value from memory and uses a loop to check whether the value is modified. If the value is changed, it exits the loop. The thread on GPU1 writes a different value to the same element. In case the value is indeed changed, that thread writes 10000000 times in the for loop. ā€œsrcā€ is allocated on GPU0 by cudaMalloc. Peer access is enabled on both GPU0 and GPU1. The kernel code is shown as below:

global void SimpleKernel(float *src, int gpuid)
{
float a = src[0];
if(gpuid==0) {
while(src[0]==a);
printf(ā€œdata is changed to %f\nā€, src[0]);
} else {
for(int i=0; i<10000000; i++) {
src[0] = 1111111;
}
printf(ā€œmodify remote data %f\nā€, src[0]);
}
}

Although there is data race, I expected GPU0 could see the changed value. But the result I tested was GPU0 did not receive the new value. It seems GPU0 reads data from L2 cache. When the value is modified, the cache is not invalidated. When the L2 cache would be flushed? Is there any hardware coherence mechanism to guarantee the correctness?

Thank you so much!

You donā€™t seem to be considering the GPU L1 cache. The L1 cache is not necessarily coherent with the L2 cache. Perhaps more to the point, in the general case, the compiler is free to optimize loads into registers.

Try marking your pointer with volatile:

$ cat t1508.cu
#include <stdio.h>

__global__ void SimpleKernel(volatile float *src, int gpuid)
{
  float a = src[0];
  if(gpuid==0) {
  while(src[0]==a);
  printf("data is changed to %f\n", src[0]);
  } else {
  for(int i=0; i<1; i++) {
    src[0] = 1111111;
  }
  printf("modify remote data %f\n", src[0]);
  }
}

int main(){

  float *data;
  cudaSetDevice(0);
  cudaMalloc(&data, sizeof(float));
  cudaMemset(data, 0, sizeof(float));
  cudaDeviceEnablePeerAccess(1, 0);
  cudaSetDevice(1);
  cudaDeviceEnablePeerAccess(0, 0);
  cudaSetDevice(0);
  SimpleKernel<<<1,1>>>(data, 0);
  cudaSetDevice(1);
  SimpleKernel<<<1,1>>>(data, 1);
  cudaDeviceSynchronize();
  cudaSetDevice(0);
  cudaDeviceSynchronize();
  return 0;
}

$ nvcc -arch=sm_35 -o t1508 t1508.cu
$ CUDA_VISIBLE_DEVICES="2,3" ./t1508
modify remote data 1111111.000000
data is changed to 1111111.000000
$

Without volatile, the above code hangs after the first line of printout, according to my testing on my test setup.

CUDA 10.1, CentOS7, dual K20 in peer-able relationship

Note that Iā€™m not doing any proper CUDA error checking in this code. When you run this code on your machine, if you have trouble, my first recommendation would be to run it with cuda-memcheck

1 Like

Hello,

Since NVLink supports cache coherence, this should not be a problem. How can I use cache coherence on NVLink instead of ā€œvolatileā€?

Thanks!

Iā€™m not sure you grasped either of my statements about L1 caching or about compiler optimizations.

Letā€™s leave L1 aside for the moment.

If the compiler is allowed to take this:

while(src[0]==a);

and do this:

LOAD R0, src[0]
S100:
COMPARE  R0, R1
BRANCH_IF_NOT_EQUAL S100
...

Iā€™m not aware of any method to fix that with hardware, of any kind. Iā€™ve never heard any definition of ā€œcache coherenceā€ that means that a change in memory (or L2) contents also affects processor register contents. Iā€™m not aware of any processor that works that way. Register contents are only modified when that register is loaded via an instruction (leaving aside things like status and flag registers).

If your expectation is that ā€œNVLink coherenceā€ implies that the above situation will be resolved, that is incorrect. In short, NVLink coherence is not a replacement for volatile, and cannot be used as a replacement for volatile.

On GPUs, volatile doesnā€™t mean that the L2 is bypassed. It means that the L1 is bypassed, and it also prevents the kind of compiler optimizations/hazards indicated above. The L2 is never bypassed, so the fact that the code works means that the L2 eventually received the update that was written by the external processor to the local memory.

hello!!
The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache).These optimizations can be disabled using the volatile keyword. This is from ā€˜ā€˜cuda-c-programming-guideā€™ā€™ .
So it is mean the complier will not cache X and Y to private cache(L1 cache) when i mark the variable (X.Y) with volatile?

if my thoughts is wrong . where can I know about compiler optimizations.

yes, currently, marking something as volatile should disable the L1 cache for loads/stores of that item

To the best of my knowledge ā€˜volatileā€™ in CUDA has the same semantics it has in C/C++: a data object so annotated may be modified by an agent outside the scope of the code in which was declared. As a corollary, any use of it must result in access to the memory location underlying that data object.

So the loop in the example of post #4 would have to be structured similar to this:

LOAD R0, src[0]
S100:
COMPARE  R0, R1
BRANCH_IF_NOT_EQUAL S200
[...]
LOAD R0, src[0]  // retrieve contents of src[0] on every loop iteration
BRANCH S100
S200:

Classical examples of volatile data objects in a PC environment are memory-mapped hardware status registers and memory locations updated by interrupt service routine.

Applying the ā€˜volatileā€™ modifier to a data object is typically a necessary, but not sufficient condition in situations where multiple agents (e.g. threads, processors) operate on the data object.

Specifically on processors with a cache hierarchy, an access to the memory location underlying a data object may or may not retrieve the latest data stored in physical memory. It depends on what coherency mechanisms are in place between levels of the memory hierarchy. Additional explicit cache-bypassing or cache-invalidating measures may need to be taken to retrieve the latest data.

Where data objects are shared by multiple agents, access to them may also need to be explicitly coordinated between them, often by means of a lock or mutex.

hello!!
if the the variable X with volatile like following .Does the thread just getting the X form shared memory and global memory?
Does thread get the X from L2 cache?

Example:
device volatile int X = 1;
or
shared volatile int X = 1;