L1-L2-Global how to clearly describe their interaction for a given kernel

Hi. I was wondering what clear statements one can do about their program regarding the reads and writes.

More precisely this is my problem: in a 2D problem, a kernel is overwriting the bottom(horizontal) boundary values and another kernel overwriting left(vertical) boundary values, and by overwrite I mean without reading the original value, how can I describe the behavior of the cache?

Here’s an ascii art of the data:

y

  _____________

 |_|_|_|_|_|_|_|

 |_|_|_|_|_|_|_|

 |_|_|_|_|_|_|_|

 |_|_|_|_|_|_|_|

 |_|_|_|_|_|_|_|

 |_|_|_|_|_|_|_|

0|_|_|_|_|_|_|_| x

  0

This is what I am thinking and most of it are just speculation as I am not very familiar with how caches work: First say the 2D data set is large enough (larger than a warpSize x warpSize), and the elements are floats, so 4 bytes. Then a warp of the horizontal kernel will have to do a minimal amount of writes, as the data being written are all local. The data needs to move from L1 to L2 then global, the cache line between the L1 and L2 is 128 byte and all the changes can fit in it, so that’s a single write, and then from L2 to global the cache line is also 128 and so the write will happen once. For the vertical kernel, the involved elements sit far apart, but the L1 to L2 write can still happen once however writing to global will have to be done 32 times.

Is this anywhere near what is actually happening? How does cached writes compare to cached reads? I am not so sure about the vertical kernel’s L1 to L2 write happening in a single step (as the cache is supposed to be an image of the memory), it might also be 32 separate writes. Also, I am not sure about the cache line sizes between the three levels.

I would really appreciate if you point out my misconceptions, and if possible describe the cache’s behavior.

Thanks

Are there really no ideas in the forum?

Where do you think I can look to be able to tell meaningful things about the cache behavior? I am not entirely sure what to make of of the programming guide cache description, it doesn’t answer my questions directly, and I don’t want to rely on speculations. If you could help me out with it can be a start.

First the cache line size between each layer, global-L2-L1.
Can the L2 serve 2 different L1s simultaneously?
Does a new write need to reach the global before it can be read? or can it be done at the L2?

I don’t see how the vertical kernel can group writes, as each element will be in a different L1 cache line. The cache line size of L1 is 128 bytes, and the cache line size of L2 is 32 bytes. This is why using PTX modifiers to skip the L1 cache can sometimes be beneficial.

I don’t know if the L2 can service multiple requests simultaneously. I don’t think that writes have to be flushed back to global memory before being read, because atomic operations became very fast in Fermi due to the addition of the L2 cache. This suggests that atomic operations from different threads on the same memory location are serviced right out of the cache.

Thanks seibert.

Even if we skip the L1 cache I don’t think the memory will be available as shared memory. It would be a really interesting setting to have no L1 cache, just the L2 and the whole L1/shared mem hardware as shared memory only. Nvidia should give the option of having only shared or only L1 as they didn’t increase those in the new architecture.

How much slower is the L2 than the L1?

Doesn’t global data always first pass from the L2 to the L1? how would there be over-fetching if the data already sits in L2, which brings things in 32 bytes?

They way it is phrased in the programming guide section F.4.2, it seems that to be able to use the L1, the same memory must be sitting in both the L1 and the L2 “Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions”. Is this intrinsic to caches or is it an Nvidia/GPU thing? Would any change to the L1 automatically “ruin” (forgot the technical word) the same memory in the L2?

Also, the figure F-1 in the same section, I suppose they mean the L1 cache for the compute capability 2.0, right?

Do you think the 128 byte cache line for the L1 have anything to do with the 4 byte large 32 banks of the hardware? so that all banks can be updated simultaneously?