How can I check and see if my GPU is using L1 cache
Hi, everyone,

I have a project to compare the performance of shared memory and L1 cache. However, [b]I don't know how to check if my Gpu is using L1 cache while computing.[/b] I found nothing from the ptx code regarding 'l1' or 'prefetch'... when I compiled the program using 'nvcc ***.cu -arch=compute_20 -code=sm_20 -ptx'.

I can get the following ptx code to show the shared memory is in use.
[code]
ld.global.f64 %fd6, [%rd28+0];
...
st.shared.f64 [%rd30+728], %fd6;
[/code]

But I can't get ptx code like this when I want to make use of L1 cache.
[code]ld.global.f64 ....
st.l1.f64 ...[/code]
or
[code]prefetch.global.l1 ....[/code]

Or my understanding about L1 cache is incorrect. I think compiler predicts when the memory access would happen during the kernel execution and then generate ptx code to "prefetch" related data to L1 cache, doesn't it?

ps: my gpu is m2050.

thanks.
Hi, everyone,



I have a project to compare the performance of shared memory and L1 cache. However, I don't know how to check if my Gpu is using L1 cache while computing. I found nothing from the ptx code regarding 'l1' or 'prefetch'... when I compiled the program using 'nvcc ***.cu -arch=compute_20 -code=sm_20 -ptx'.



I can get the following ptx code to show the shared memory is in use.



ld.global.f64 %fd6, [%rd28+0];

...

st.shared.f64 [%rd30+728], %fd6;




But I can't get ptx code like this when I want to make use of L1 cache.

ld.global.f64 ....

st.l1.f64 ...


or

prefetch.global.l1 ....




Or my understanding about L1 cache is incorrect. I think compiler predicts when the memory access would happen during the kernel execution and then generate ptx code to "prefetch" related data to L1 cache, doesn't it?



ps: my gpu is m2050.



thanks.

#1
Posted 05/21/2011 03:04 AM   
L1 cache use is completely transparent (like on a CPU), no code change is needed to use it. And while there is a prefetch instruction in PTX, the compiler does not emit it.
L1 cache use is completely transparent (like on a CPU), no code change is needed to use it. And while there is a prefetch instruction in PTX, the compiler does not emit it.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 05/21/2011 08:59 AM   
Excuse, so basically I can or can not use something like
[code]ld.global.f64 ....
st.l1.f64 ...[/code]
to explicitly access l1 cache?
Excuse, so basically I can or can not use something like

ld.global.f64 ....

st.l1.f64 ...


to explicitly access l1 cache?

#3
Posted 05/22/2011 09:08 PM   
ld.global always goes to the L1 cache (I think this would be true even when L1 is disabled). When you have a hit, that's it. When you have a miss, the L1 then goes to L2 and so on.
ld.global always goes to the L1 cache (I think this would be true even when L1 is disabled). When you have a hit, that's it. When you have a miss, the L1 then goes to L2 and so on.

Working on a Fermi assembler.. for the fun of it! :)

#4
Posted 05/23/2011 07:06 AM   
There is no [font="Courier New"]st.l1[/font] instruction. If you are looking for explicitly user-managed cache, use shared memory.
There is no st.l1 instruction. If you are looking for explicitly user-managed cache, use shared memory.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#5
Posted 05/23/2011 08:24 AM   
[quote name='MichealHou' date='20 May 2011 - 09:04 PM' timestamp='1305947049' post='1240322']
Hi, everyone,

I have a project to compare the performance of shared memory and L1 cache. However, [b]I don't know how to check if my Gpu is using L1 cache while computing.[/b] I found nothing from the ptx code regarding 'l1' or 'prefetch'... when I compiled the program using 'nvcc ***.cu -arch=compute_20 -code=sm_20 -ptx'.

I can get the following ptx code to show the shared memory is in use.
[code]
ld.global.f64 %fd6, [%rd28+0];
...
st.shared.f64 [%rd30+728], %fd6;
[/code]

But I can't get ptx code like this when I want to make use of L1 cache.
[code]ld.global.f64 ....
st.l1.f64 ...[/code]
or
[code]prefetch.global.l1 ....[/code]

Or my understanding about L1 cache is incorrect. I think compiler predicts when the memory access would happen during the kernel execution and then generate ptx code to "prefetch" related data to L1 cache, doesn't it?

ps: my gpu is m2050.

thanks.
[/quote]

Hi MichealHou,

I'm also struggling on disabling the L1 cache. I have a program including two cuda files (.cu) and a cpp file, and I wanna disable the L1 by adding the flag "-Xptxas -dlcm=cg" in the makefile. This approach works fine when I tested on some applications from the CUDA SDK, but it poses no impact on my code. Even if I set the flag, the performance of my code does not change, and the profiler reports almost identical results for counters like l1_gld_hit and l1_gld_miss. According to my understand, values of these two counters should be zero after the L1 is disabled. I really have no idea on this....

Do you have any clue on this? or do you have similar experience in your previous work?

Thanks
[quote name='MichealHou' date='20 May 2011 - 09:04 PM' timestamp='1305947049' post='1240322']

Hi, everyone,



I have a project to compare the performance of shared memory and L1 cache. However, I don't know how to check if my Gpu is using L1 cache while computing. I found nothing from the ptx code regarding 'l1' or 'prefetch'... when I compiled the program using 'nvcc ***.cu -arch=compute_20 -code=sm_20 -ptx'.



I can get the following ptx code to show the shared memory is in use.



ld.global.f64 %fd6, [%rd28+0];

...

st.shared.f64 [%rd30+728], %fd6;




But I can't get ptx code like this when I want to make use of L1 cache.

ld.global.f64 ....

st.l1.f64 ...


or

prefetch.global.l1 ....




Or my understanding about L1 cache is incorrect. I think compiler predicts when the memory access would happen during the kernel execution and then generate ptx code to "prefetch" related data to L1 cache, doesn't it?



ps: my gpu is m2050.



thanks.





Hi MichealHou,



I'm also struggling on disabling the L1 cache. I have a program including two cuda files (.cu) and a cpp file, and I wanna disable the L1 by adding the flag "-Xptxas -dlcm=cg" in the makefile. This approach works fine when I tested on some applications from the CUDA SDK, but it poses no impact on my code. Even if I set the flag, the performance of my code does not change, and the profiler reports almost identical results for counters like l1_gld_hit and l1_gld_miss. According to my understand, values of these two counters should be zero after the L1 is disabled. I really have no idea on this....



Do you have any clue on this? or do you have similar experience in your previous work?



Thanks

#6
Posted 06/08/2011 03:05 PM   
[quote name='hyqneuron' date='23 May 2011 - 01:06 AM' timestamp='1306134360' post='1241154']
ld.global always goes to the L1 cache (I think this would be true even when L1 is disabled). When you have a hit, that's it. When you have a miss, the L1 then goes to L2 and so on.
[/quote]

Hi hyqneuron,

I'm also sort of confused on the L1 cache. you say the global load accesses always go to the L1 cache. I believe this is the case when L1 is enabled. However, if we disable the L1 by setting the flag "-Xptxas -dlcm=cg", do the global loads still go through the L1? If the answer is positive as you thought, then what's the difference after the L1 is disabled?

Thanks
[quote name='hyqneuron' date='23 May 2011 - 01:06 AM' timestamp='1306134360' post='1241154']

ld.global always goes to the L1 cache (I think this would be true even when L1 is disabled). When you have a hit, that's it. When you have a miss, the L1 then goes to L2 and so on.





Hi hyqneuron,



I'm also sort of confused on the L1 cache. you say the global load accesses always go to the L1 cache. I believe this is the case when L1 is enabled. However, if we disable the L1 by setting the flag "-Xptxas -dlcm=cg", do the global loads still go through the L1? If the answer is positive as you thought, then what's the difference after the L1 is disabled?



Thanks

#7
Posted 06/08/2011 03:12 PM   
[quote name='edisonying1984' date='08 June 2011 - 11:12 PM' timestamp='1307545930' post='1249228']
Hi hyqneuron,

I'm also sort of confused on the L1 cache. you say the global load accesses always go to the L1 cache. I believe this is the case when L1 is enabled. However, if we disable the L1 by setting the flag "-Xptxas -dlcm=cg", do the global loads still go through the L1? If the answer is positive as you thought, then what's the difference after the L1 is disabled?

Thanks
[/quote]

Cacheline size becomes 32 byte instead of 128 byte; When your accesses have little locality, this saves a lot of global mem band width.
[quote name='edisonying1984' date='08 June 2011 - 11:12 PM' timestamp='1307545930' post='1249228']

Hi hyqneuron,



I'm also sort of confused on the L1 cache. you say the global load accesses always go to the L1 cache. I believe this is the case when L1 is enabled. However, if we disable the L1 by setting the flag "-Xptxas -dlcm=cg", do the global loads still go through the L1? If the answer is positive as you thought, then what's the difference after the L1 is disabled?



Thanks





Cacheline size becomes 32 byte instead of 128 byte; When your accesses have little locality, this saves a lot of global mem band width.

Working on a Fermi assembler.. for the fun of it! :)

#8
Posted 06/09/2011 01:10 AM   
Scroll To Top