How can I check and see if my GPU is using L1 cache

Hi, everyone,

I have a project to compare the performance of shared memory and L1 cache. However, I don’t know how to check if my Gpu is using L1 cache while computing. I found nothing from the ptx code regarding ‘l1’ or ‘prefetch’… when I compiled the program using ‘nvcc ***.cu -arch=compute_20 -code=sm_20 -ptx’.

I can get the following ptx code to show the shared memory is in use.

ld.global.f64 	%fd6, [%rd28+0];

...

st.shared.f64 	[%rd30+728], %fd6;

But I can’t get ptx code like this when I want to make use of L1 cache.

ld.global.f64 ....

st.l1.f64 ...

or

prefetch.global.l1 ....

Or my understanding about L1 cache is incorrect. I think compiler predicts when the memory access would happen during the kernel execution and then generate ptx code to “prefetch” related data to L1 cache, doesn’t it?

ps: my gpu is m2050.

thanks.

L1 cache use is completely transparent (like on a CPU), no code change is needed to use it. And while there is a prefetch instruction in PTX, the compiler does not emit it.

Excuse, so basically I can or can not use something like

ld.global.f64 ....

st.l1.f64 ...

to explicitly access l1 cache?

ld.global always goes to the L1 cache (I think this would be true even when L1 is disabled). When you have a hit, that’s it. When you have a miss, the L1 then goes to L2 and so on.

There is no [font=“Courier New”]st.l1[/font] instruction. If you are looking for explicitly user-managed cache, use shared memory.

Hi MichealHou,

I’m also struggling on disabling the L1 cache. I have a program including two cuda files (.cu) and a cpp file, and I wanna disable the L1 by adding the flag “-Xptxas -dlcm=cg” in the makefile. This approach works fine when I tested on some applications from the CUDA SDK, but it poses no impact on my code. Even if I set the flag, the performance of my code does not change, and the profiler reports almost identical results for counters like l1_gld_hit and l1_gld_miss. According to my understand, values of these two counters should be zero after the L1 is disabled. I really have no idea on this…

Do you have any clue on this? or do you have similar experience in your previous work?

Thanks

Hi hyqneuron,

I’m also sort of confused on the L1 cache. you say the global load accesses always go to the L1 cache. I believe this is the case when L1 is enabled. However, if we disable the L1 by setting the flag “-Xptxas -dlcm=cg”, do the global loads still go through the L1? If the answer is positive as you thought, then what’s the difference after the L1 is disabled?

Thanks

Cacheline size becomes 32 byte instead of 128 byte; When your accesses have little locality, this saves a lot of global mem band width.