constant cache

Hello,

I have a question about reading from constant memory versus reading from global memory in the context of a compute capability 2.0 device.

If all the threads in a half warp read the same 4 byte word in global memory, then there will be one 128 byte read request from the L1 cache then if miss L2 cache, then if there’s another miss, a read request from global memory. The loaded 4 bytes will then be provided to each thread in the half warp
If subsequent half warps request that same memory address, that data is likely cached, so memory loading will be quick.

Am I right to assume the same sort of thing happens for constant memory?

i.e. all threads in a half warp read the same 4 byte word in constant memory, there is then one 128 byte read request from the constant cache; if there is a constant cache miss then there is a request from constant memory (which is as slow as a global memory).
If subsequent half warps request that same constant memory address, that data is likely cached, so memory loading will be quick.

What then, is the advantage of using constant memory? I have read that constant memory reads will be “broadcast” to the entire half-warp, provided all threads in that half-warp request the same constant memory address. But would this not happen for global memory accesses as well, as I describe above?

Additionally, I just wanted confirmation that there are no profiler counters to measure constant memory requests/constant cache hits & misses?

Global memory reads are indeed broadcast as well if possible.

Constant cache reads OTOH are serialised if not all accesses of the full warp (not half-warp, that only applies to compute capability 1.x) go to the same address.

The main benefit of the constant cache is on compute capability 1.x devices where global memory is not cached.

So is the main advantage of having constant memory on devices of compute capability > 1.x the fact that it has its own dedicated cache of 8 kb/MP? thereby relieving some pressure on L1 and L2 caches for global memory?

In a sense, yes. I prefer to employ a slightly different perspective: By placing small, relatively frequently used pieces of read-only data with predominantly uniform access in the constant cache, one can often keep this data resident in a cache for a long time effectively achieving close to register performance, while the massive amount of traffic that typically flows through the L1 and L2 caches, in combination with their relatively small sizes, causes data to be discarded from these cache relatively quickly.

So the constant cache is “ideal” storage for the kernel launch arguments, as well as constants the compiler cannot place into the immediate fields of machine instructions (regardless of whether these constants are derived from literal constants in the code or compiler generated). To that add small programmer-provided tables etc. The constant cache is very small (single-digit KBs) so utilizing the full 64 KB of user visible constant memory may reduce the performance-enhancing value of the cache, and could lead to the kind of thrashing that frequently occurs in the other caches.