Disabling cache on Fermi architectures Try to disable L1 and L2

In relation to some research, I am trying to disable caches on my Fermi card (GTX470).

So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:
-Xptxas -dlcm=cg
This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the ‘dlcm’ option, other than that it supports the values ca (enable L1) and cg (disable L1).

Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:

  • As a compiler flag (similar to disabling L1 cache)
  • As a function in the CUDA (host/kernel) code
  • As a workaround (tricking the compiler not to cache)

Input is welcome!

As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

Thanks, that pointed me in the correct direction. From the PTX manual we have:

(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the ‘cs’ option, performance decreases again! From the description, I guess that this means we omit both caches:

Thanks, that pointed me in the correct direction. From the PTX manual we have:

(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the ‘cs’ option, performance decreases again! From the description, I guess that this means we omit both caches:

I’m using your topic as I’m currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I’m really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I’m using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.

I’m using your topic as I’m currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I’m really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I’m using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.

maybe the L1 cache miss rate is too high. lots of time are wasted on check.

I’m not quite sure about the latency of cache operations. But another thing is that when L1 is disabled L1 starts issuing 32-byte cache line accesses instead of the default 128-byte lines. That saves some global memory bandwidth if you do not read/write to continuous regions.

the nvcc -Xptxas -dlcm appears to be applied by the CUDA compiler to
the kernel being compiled. But caches are global.
When I start another kernel (compiled without -Xptxas) will the
caches all revert to normal (ie their defaults).
Thanks
Bill

-Xptxas -dlcm does not cause machine state to be changed. It changes the code generation, so a different flavor of load instructions for accessing global memory is generated. Only the global load instructions in a given compilation unit are affected. One can change the load behavior for individual global memory accesses by generating the desired load instruction flavor via inline PTX.

Dear njuffa,
Thank you very much for rapid and helpful reply.
Bill