code instruction cache?

A quick hunt did not reveal anything recent or definitive on the GPU code cache.
For example do we know how big it is?
For a kernel source of a few hundred lines, is it all going to fit into the i-cache?
I guess system library routines must also fit into the i-cache?
Is there just one or does each SMX have its own?
Am I right in assuming the cache for programs is totally separate from the various
data caches? Or are there opportunities for trading code and data caches against
each other?

Thank you

Bill

There is an instruction cache. Each SM has one. The details of it are unpublished, AFAIK, which is why you’re having trouble locating the details. The instruction cache is depicted as a separate entity that is a per-SM resource, for example on p.8 of the Fermi whitepaper:

[url]http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf[/url]

but there is essentially no mention of it elsewhere in the document.

This might help you or it might not.

http://www.stuffedcow.net/research/cudabmk
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Anyone knows the effect of running different kernels concurrently on the instruction cache?
Can two blocks from different kernels be scheduled on the same SM at the same time?

Thanks.

there is a per sm limit on the number of concurrent blocks; there is a per device limit on the number of kernels; both stated in the pg

if i am not mistaken, (some of) the kernel code is also loaded into global memory
so the instruction cache may likely be just sufficient to cache instruction loading from global memory, given the max number of blocks per sm

This gem of a paper discusses the effect of uber-kernels on the instruction cache (Figure 9):

Singe: Leveraging Warp Specialization for High Performance on GPUs

Note that the paper might be conflating max # of resident blocks per SM with i-cache limitations.

NVIDIA knows for sure. :)

The size of the per-SM instruction cache can be determine through a microbenchmark that uses a loop of increasing size: there is a small but measurable drop in execution speed once the loop body exceeds the Icache size. I performed such an experiment in the past, and from my recollection the Icache size was 4 KB, but I don’t recall what part I measured on (most likely a K20) and the size may easily differ between different architectures.

The GPU instructions in general are 8 bytes long, and for a Maxwell (sm_5x) architecture one can easily see from a disassembled binary that there are an additional 8 bytes of control information added for every three actual instructions. So a 4 KB Icache would hold 384 instructions for an sm_5x part. In light of aggressive inlining by the compiler, loop bodies for various real-life scenarios can exceed this size. In my (pre-Maxwell) experience the performance penalty on a loop that exceeds the ICache was never larger than about 3%.

So I personally do not worry about Icache misses. As with other stall events on the GPU a large number of threads running with zero-overhead context switch are generally able to cover the latency well.

It is unclear what kind of trade-offs between Icache and Dcache you are thinking of. Something like switch statements versus function pointers? Recomputation versus lookup tables? There are other mechanisms that impact those decisions that are likely higher impact, such as branch divergence and serialization.

Yes. If you want a reference for this affirmation, take a look at slide 19 in this presentation:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

“Warps can come from different threadblocks and different concurent kernels”

@allanmac This is a very interesting paper indeed, thanks.

@allanmac: I think it is highly unlikely that Sean Treichler would conflate different architecture mechanisms in a GPU :-) [url]https://research.nvidia.com/users/sean-treichler[/url]

Ha!

Then the authors should be less coy about what’s actually happening under the hood!

:)

I’ll back up njuffa here. I also did as he did but with my assembler controlling the exact number and type of instructions (I also now use this code to probe the hardware on a number implementation details). I measured it on Maxwell (sm_50, I haven’t measured sm_52 yet) as being 8K. That means you can have a stall-less loop of 768 instructions with 256 control codes in between. Though if you want to avoid instruction cache stalls your loop will likely have to be smaller because the start of the loop is probably not going to be aligned with the starting address of the cache.

Typically it’s not something you need to be worried about unless you’re working at very low occupancy (which I happen to do a lot).

Disassembly of code compiled for sm_5x suggests that the CUDA compiler may make an effort to align loops (by inserting a bunch of NOPs) on that architecture, although I have not experimented extensively enough to say how reliably it does so or under what conditions. When in doubt, cuobjdump is your friend. Good to hear the size of the instruction cache was bumped to 8K.