Disabling cache on Fermi architectures Try to disable L1 and L2
In relation to some research, I am trying to disable caches on my Fermi card (GTX470).

So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:
-Xptxas -dlcm=cg
This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the 'dlcm' option, other than that it supports the values ca (enable L1) and cg (disable L1).

Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:
- As a compiler flag (similar to disabling L1 cache)
- As a function in the CUDA (host/kernel) code
- As a workaround (tricking the compiler not to cache)

Input is welcome!
In relation to some research, I am trying to disable caches on my Fermi card (GTX470).



So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:

-Xptxas -dlcm=cg

This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the 'dlcm' option, other than that it supports the values ca (enable L1) and cg (disable L1).



Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:

- As a compiler flag (similar to disabling L1 cache)

- As a function in the CUDA (host/kernel) code

- As a workaround (tricking the compiler not to cache)



Input is welcome!

Parallel Architecture Research in Eindhoven:

http://parse.ele.tue.nl

#1
Posted 09/10/2010 12:48 PM   
In relation to some research, I am trying to disable caches on my Fermi card (GTX470).

So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:
-Xptxas -dlcm=cg
This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the 'dlcm' option, other than that it supports the values ca (enable L1) and cg (disable L1).

Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:
- As a compiler flag (similar to disabling L1 cache)
- As a function in the CUDA (host/kernel) code
- As a workaround (tricking the compiler not to cache)

Input is welcome!
In relation to some research, I am trying to disable caches on my Fermi card (GTX470).



So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:

-Xptxas -dlcm=cg

This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the 'dlcm' option, other than that it supports the values ca (enable L1) and cg (disable L1).



Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:

- As a compiler flag (similar to disabling L1 cache)

- As a function in the CUDA (host/kernel) code

- As a workaround (tricking the compiler not to cache)



Input is welcome!

Parallel Architecture Research in Eindhoven:

http://parse.ele.tue.nl

#2
Posted 09/10/2010 12:48 PM   
As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that [b]every[/b] memory transaction goes through L2. (i.e., this one: [url="https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf"]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url="http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations"]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.
As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: https://hub.vscse.org/resources/287/downloa...overview_DK.pdf slide 12 - there is a video too http://groups.google.com/group/vscse-many-...e-presentations )



You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

#3
Posted 09/10/2010 01:40 PM   
As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that [b]every[/b] memory transaction goes through L2. (i.e., this one: [url="https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf"]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url="http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations"]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.
As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: https://hub.vscse.org/resources/287/downloa...overview_DK.pdf slide 12 - there is a video too http://groups.google.com/group/vscse-many-...e-presentations )



You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

#4
Posted 09/10/2010 01:40 PM   
Thanks, that pointed me in the correct direction. From the PTX manual we have:
[quote].ca Cache at all levels, likely to be accessed again.
.cg Cache at global level (cache in L2 and below, not L1).
.cs Cache streaming, likely to be accessed once.
.cv Cache as volatile (consider cached system memory lines stale, fetch again).[/quote]
(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the 'cs' option, performance decreases again! From the description, I guess that this means we omit both caches:

[quote]The ld.cs load cached streaming operation allocates global lines with evict-first policy in L1
and L2 to limit cache pollution by temporary streaming data that may be accessed once or
twice. When ld.cs is applied to a Local window address, it performs the ld.lu operation.[/quote]
Thanks, that pointed me in the correct direction. From the PTX manual we have:

.ca Cache at all levels, likely to be accessed again.

.cg Cache at global level (cache in L2 and below, not L1).

.cs Cache streaming, likely to be accessed once.

.cv Cache as volatile (consider cached system memory lines stale, fetch again).


(there is a more detailed description in the manual on page 109 (ptx_isa_20)).



Luckily, these options translate to the -dlcm option for ptxas. When I try the 'cs' option, performance decreases again! From the description, I guess that this means we omit both caches:



The ld.cs load cached streaming operation allocates global lines with evict-first policy in L1

and L2 to limit cache pollution by temporary streaming data that may be accessed once or

twice. When ld.cs is applied to a Local window address, it performs the ld.lu operation.

Parallel Architecture Research in Eindhoven:

http://parse.ele.tue.nl

#5
Posted 09/10/2010 02:07 PM   
Thanks, that pointed me in the correct direction. From the PTX manual we have:
[quote].ca Cache at all levels, likely to be accessed again.
.cg Cache at global level (cache in L2 and below, not L1).
.cs Cache streaming, likely to be accessed once.
.cv Cache as volatile (consider cached system memory lines stale, fetch again).[/quote]
(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the 'cs' option, performance decreases again! From the description, I guess that this means we omit both caches:

[quote]The ld.cs load cached streaming operation allocates global lines with evict-first policy in L1
and L2 to limit cache pollution by temporary streaming data that may be accessed once or
twice. When ld.cs is applied to a Local window address, it performs the ld.lu operation.[/quote]
Thanks, that pointed me in the correct direction. From the PTX manual we have:

.ca Cache at all levels, likely to be accessed again.

.cg Cache at global level (cache in L2 and below, not L1).

.cs Cache streaming, likely to be accessed once.

.cv Cache as volatile (consider cached system memory lines stale, fetch again).


(there is a more detailed description in the manual on page 109 (ptx_isa_20)).



Luckily, these options translate to the -dlcm option for ptxas. When I try the 'cs' option, performance decreases again! From the description, I guess that this means we omit both caches:



The ld.cs load cached streaming operation allocates global lines with evict-first policy in L1

and L2 to limit cache pollution by temporary streaming data that may be accessed once or

twice. When ld.cs is applied to a Local window address, it performs the ld.lu operation.

Parallel Architecture Research in Eindhoven:

http://parse.ele.tue.nl

#6
Posted 09/10/2010 02:07 PM   
I'm using your topic as I'm currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.
I'm using your topic as I'm currently doing the same analysis.



I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.

When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?



If I disable the L1 and L2 cache, I however lower the performance by ~10%.



I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.



Any help would be really appreciated.

#7
Posted 09/13/2010 09:15 AM   
I'm using your topic as I'm currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.
I'm using your topic as I'm currently doing the same analysis.



I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.

When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?



If I disable the L1 and L2 cache, I however lower the performance by ~10%.



I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.



Any help would be really appreciated.

#8
Posted 09/13/2010 09:15 AM   
[quote name='Magorath' date='13 September 2010 - 05:15 PM' timestamp='1284369347' post='1116431']
I'm using your topic as I'm currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.
[/quote]

maybe the L1 cache miss rate is too high. lots of time are wasted on check.
[quote name='Magorath' date='13 September 2010 - 05:15 PM' timestamp='1284369347' post='1116431']

I'm using your topic as I'm currently doing the same analysis.



I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.

When I disable L1 cache, I get 20% better performance. I'm really struggling to interpret this correctly. Could someone maybe give some piece of advice ?



If I disable the L1 and L2 cache, I however lower the performance by ~10%.



I'm using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.



Any help would be really appreciated.





maybe the L1 cache miss rate is too high. lots of time are wasted on check.

#9
Posted 05/23/2011 06:43 AM   
I'm not quite sure about the latency of cache operations. But another thing is that when L1 is disabled L1 starts issuing 32-byte cache line accesses instead of the default 128-byte lines. That saves some global memory bandwidth if you do not read/write to continuous regions.
I'm not quite sure about the latency of cache operations. But another thing is that when L1 is disabled L1 starts issuing 32-byte cache line accesses instead of the default 128-byte lines. That saves some global memory bandwidth if you do not read/write to continuous regions.

Working on a Fermi assembler.. for the fun of it! :)

#10
Posted 05/23/2011 11:57 AM   
the nvcc -Xptxas -dlcm appears to be applied by the CUDA compiler to the kernel being compiled. But caches are global. When I start another kernel (compiled without -Xptxas) will the caches all revert to normal (ie their defaults). Thanks Bill
the nvcc -Xptxas -dlcm appears to be applied by the CUDA compiler to
the kernel being compiled. But caches are global.
When I start another kernel (compiled without -Xptxas) will the
caches all revert to normal (ie their defaults).
Thanks
Bill

#11
Posted 08/29/2013 03:19 PM   
-Xptxas -dlcm does not cause machine state to be changed. It changes the code generation, so a different flavor of load instructions for accessing global memory is generated. Only the global load instructions in a given compilation unit are affected. One can change the load behavior for individual global memory accesses by generating the desired load instruction flavor via inline PTX.
-Xptxas -dlcm does not cause machine state to be changed. It changes the code generation, so a different flavor of load instructions for accessing global memory is generated. Only the global load instructions in a given compilation unit are affected. One can change the load behavior for individual global memory accesses by generating the desired load instruction flavor via inline PTX.

#12
Posted 08/29/2013 05:46 PM   
Dear njuffa, Thank you very much for rapid and helpful reply. Bill
Dear njuffa,
Thank you very much for rapid and helpful reply.
Bill

#13
Posted 08/30/2013 10:31 AM   
Scroll To Top