Kepler global memory latency What is it?
  1 / 2    
How do we find out Kepler's global memory latency(etc.?)

Thanks.
How do we find out Kepler's global memory latency(etc.?)



Thanks.

#1
Posted 04/15/2012 11:29 PM   
Has anyone updated this for 64-bit pointers, as mentioned on the page?

http://www.stuffedcow.net/research/cudabmk
Has anyone updated this for 64-bit pointers, as mentioned on the page?



http://www.stuffedcow.net/research/cudabmk

#2
Posted 04/16/2012 01:14 PM   
[quote name='RezaRob3' date='15 April 2012 - 04:29 PM' timestamp='1334532565' post='1396813']
How do we find out Kepler's global memory latency(etc.?)
[/quote]

I used microbenchmarks and got 270 cycles if unloaded, i.e. minimum latency. For the latency under load (memcpy) I got 450-1400 cycles on average.

However, my EVGA GTX680 has only 700 MHz clock rate, I wonder why...
[quote name='RezaRob3' date='15 April 2012 - 04:29 PM' timestamp='1334532565' post='1396813']

How do we find out Kepler's global memory latency(etc.?)





I used microbenchmarks and got 270 cycles if unloaded, i.e. minimum latency. For the latency under load (memcpy) I got 450-1400 cycles on average.



However, my EVGA GTX680 has only 700 MHz clock rate, I wonder why...

#3
Posted 04/19/2012 09:40 AM   
[quote name='vvolkov' date='19 April 2012 - 03:40 AM' timestamp='1334828439' post='1398232']
I used microbenchmarks and got 270 cycles if unloaded, i.e. minimum latency. For the latency under load (memcpy) I got 450-1400 cycles on average.

However, my EVGA GTX680 has only 700 MHz clock rate, I wonder why...
[/quote]

The nvidia-settings tool (and possibly other device queries) is not reporting the correct values in the current driver:

http://www.phoronix.com/scan.php?page=news_item&px=MTA4ODc

Apparently, the dynamic clocking in GK104 is making this confusing.

Edit: Oops. Didn't see you got the same answer already in another thread. :)
[quote name='vvolkov' date='19 April 2012 - 03:40 AM' timestamp='1334828439' post='1398232']

I used microbenchmarks and got 270 cycles if unloaded, i.e. minimum latency. For the latency under load (memcpy) I got 450-1400 cycles on average.



However, my EVGA GTX680 has only 700 MHz clock rate, I wonder why...





The nvidia-settings tool (and possibly other device queries) is not reporting the correct values in the current driver:



http://www.phoronix.com/scan.php?page=news_item&px=MTA4ODc



Apparently, the dynamic clocking in GK104 is making this confusing.



Edit: Oops. Didn't see you got the same answer already in another thread. :)

#4
Posted 04/19/2012 02:12 PM   
From my experience, memory latency is variable depending on the load on the memory bus AND the load pattern itself (coalesced or not, 32bit, 64 bit, 128 bits, streaming, random, etc)
From my experience, memory latency is variable depending on the load on the memory bus AND the load pattern itself (coalesced or not, 32bit, 64 bit, 128 bits, streaming, random, etc)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#5
Posted 04/19/2012 06:37 PM   
[quote name='parallelis' date='19 April 2012 - 11:37 AM' timestamp='1334860671' post='1398425']
From my experience, memory latency is variable depending on the load on the memory bus AND the load pattern itself (coalesced or not, 32bit, 64 bit, 128 bits, streaming, random, etc)
[/quote]
Bother to share any particular numbers?
[quote name='parallelis' date='19 April 2012 - 11:37 AM' timestamp='1334860671' post='1398425']

From my experience, memory latency is variable depending on the load on the memory bus AND the load pattern itself (coalesced or not, 32bit, 64 bit, 128 bits, streaming, random, etc)



Bother to share any particular numbers?

#6
Posted 04/20/2012 05:40 AM   
[quote name='vvolkov' date='20 April 2012 - 01:40 AM' timestamp='1334900407' post='1398602']
Bother to share any particular numbers?
[/quote]
They just have no meaning, if you compare a single memory access on a 9400M GT that is pretty fast (memory latency is low, it could be under 100 cuda-core cycles!) to a fully-loaded Fermi card all cuda core trying to random read the memory simultaneously (and at least 6 warp per SM), you may be over 1500 cycles of latency.
Coalescing is *ALWAYS* better, the worst case being simultaneous random read & write naturally (if you except atomic operations).

RezaRob3, depending on my project and optimization time allowed (sic), when it's possible, I begin by comparing memory access pattern writing a mock code that just duplicate the memory access pattern that the algorithms will use. As global memory (and local memory too) is the usual bottleneck on serious GPGPU development, when it needs to be fast it,s the first thing I consider, even trading-off computing-cycle to lessen global memory access.

Last example is on the GPGPU OpenCL Chess Engine I am writing on my spare time, where I will generate move twice for each move to avoid storing it on memory (such a pity!). It's faster to re-generate a move in this particular case than to wait for memory read!
[quote name='vvolkov' date='20 April 2012 - 01:40 AM' timestamp='1334900407' post='1398602']

Bother to share any particular numbers?



They just have no meaning, if you compare a single memory access on a 9400M GT that is pretty fast (memory latency is low, it could be under 100 cuda-core cycles!) to a fully-loaded Fermi card all cuda core trying to random read the memory simultaneously (and at least 6 warp per SM), you may be over 1500 cycles of latency.

Coalescing is *ALWAYS* better, the worst case being simultaneous random read & write naturally (if you except atomic operations).



RezaRob3, depending on my project and optimization time allowed (sic), when it's possible, I begin by comparing memory access pattern writing a mock code that just duplicate the memory access pattern that the algorithms will use. As global memory (and local memory too) is the usual bottleneck on serious GPGPU development, when it needs to be fast it,s the first thing I consider, even trading-off computing-cycle to lessen global memory access.



Last example is on the GPGPU OpenCL Chess Engine I am writing on my spare time, where I will generate move twice for each move to avoid storing it on memory (such a pity!). It's faster to re-generate a move in this particular case than to wait for memory read!

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#7
Posted 04/23/2012 08:37 PM   
[quote name='parallelis' date='23 April 2012 - 01:37 PM' timestamp='1335213444' post='1400023']
They just have no meaning, if you compare a single memory access on a 9400M GT that is pretty fast (memory latency is low, it could be under 100 cuda-core cycles!) to a fully-loaded Fermi card all cuda core trying to random read the memory simultaneously (and at least 6 warp per SM), you may be over 1500 cycles of latency.
[/quote]

According to the programming guide, the minimum latency is 400 cycles. Of course this is only a ballpark estimate, but it seems to fit observations across a number of GPUs, at least in microbenchmarks. (May be not on Kepler.)

In the light of that, < 100 cycles latency on a GPU without caches is quite surprising. When you say cuda-core cycles do you mean the slower "core clock"/"graphics clock" that is ~500 MHz or the faster "shader clock"/"processor clock" that is 1400 MHz or so?

Loaded and unloaded latency are surely different but that makes it only more interesting. So, what is the maximum latency? The programming guide cites 800 cycles, is it anywhere close to truth?

Could you also say a few words about how you measure it? Do you use clock()/clock64()? Where do you place them? Do you check that compiler doesn't move these instructions around? Does this instrumentation affect the performance of the kernel? (Writing the timings to memory may consume substantial bandwidth.)
[quote name='parallelis' date='23 April 2012 - 01:37 PM' timestamp='1335213444' post='1400023']

They just have no meaning, if you compare a single memory access on a 9400M GT that is pretty fast (memory latency is low, it could be under 100 cuda-core cycles!) to a fully-loaded Fermi card all cuda core trying to random read the memory simultaneously (and at least 6 warp per SM), you may be over 1500 cycles of latency.





According to the programming guide, the minimum latency is 400 cycles. Of course this is only a ballpark estimate, but it seems to fit observations across a number of GPUs, at least in microbenchmarks. (May be not on Kepler.)



In the light of that, < 100 cycles latency on a GPU without caches is quite surprising. When you say cuda-core cycles do you mean the slower "core clock"/"graphics clock" that is ~500 MHz or the faster "shader clock"/"processor clock" that is 1400 MHz or so?



Loaded and unloaded latency are surely different but that makes it only more interesting. So, what is the maximum latency? The programming guide cites 800 cycles, is it anywhere close to truth?



Could you also say a few words about how you measure it? Do you use clock()/clock64()? Where do you place them? Do you check that compiler doesn't move these instructions around? Does this instrumentation affect the performance of the kernel? (Writing the timings to memory may consume substantial bandwidth.)

#8
Posted 04/23/2012 09:23 PM   
Seibert, thanks a lot for the cudabmk link.

And what about the Kepler/GTX-680 shared memory throughput? It seems like 32 memory banks is quite limiting for 192 cores!
Do we yet know what's in the "7 billion transistor" version?
Seibert, thanks a lot for the cudabmk link.



And what about the Kepler/GTX-680 shared memory throughput? It seems like 32 memory banks is quite limiting for 192 cores!

Do we yet know what's in the "7 billion transistor" version?

#9
Posted 04/24/2012 04:53 AM   
[quote name='RezaRob3' date='23 April 2012 - 09:53 PM' timestamp='1335243213' post='1400169']
And what about the Kepler/GTX-680 shared memory throughput? It seems like 32 memory banks is quite limiting for 192 cores!
[/quote]

banks are twice as wide - 8 bytes per bank.
[quote name='RezaRob3' date='23 April 2012 - 09:53 PM' timestamp='1335243213' post='1400169']

And what about the Kepler/GTX-680 shared memory throughput? It seems like 32 memory banks is quite limiting for 192 cores!





banks are twice as wide - 8 bytes per bank.

#10
Posted 04/24/2012 10:03 AM   
[quote name='vvolkov' date='24 April 2012 - 03:03 AM' timestamp='1335261785' post='1400241']
banks are twice as wide - 8 bytes per bank.
[/quote]

Yuk! How do you optimize for that? Wouldn't you have to do an 8-byte read to take full advantage of it?

Also, how do you know that?

Thanks.
[quote name='vvolkov' date='24 April 2012 - 03:03 AM' timestamp='1335261785' post='1400241']

banks are twice as wide - 8 bytes per bank.





Yuk! How do you optimize for that? Wouldn't you have to do an 8-byte read to take full advantage of it?



Also, how do you know that?



Thanks.

#11
Posted 04/24/2012 05:20 PM   
[quote name='RezaRob3' date='24 April 2012 - 10:20 AM' timestamp='1335288000' post='1400393']
Yuk! How do you optimize for that? Wouldn't you have to do an 8-byte read to take full advantage of it?
[/quote]
Yep, I guess you have...

[quote name='RezaRob3' date='24 April 2012 - 10:20 AM' timestamp='1335288000' post='1400393']
Also, how do you know that?
[/quote]
It's in the new Programming Guide, Section F.5.3: "Shared memory has 32 banks... Each bank has a bandwidth of 64 bits per clock cycle."
[quote name='RezaRob3' date='24 April 2012 - 10:20 AM' timestamp='1335288000' post='1400393']

Yuk! How do you optimize for that? Wouldn't you have to do an 8-byte read to take full advantage of it?



Yep, I guess you have...



[quote name='RezaRob3' date='24 April 2012 - 10:20 AM' timestamp='1335288000' post='1400393']

Also, how do you know that?



It's in the new Programming Guide, Section F.5.3: "Shared memory has 32 banks... Each bank has a bandwidth of 64 bits per clock cycle."

#12
Posted 04/24/2012 06:00 PM   
[quote name='vvolkov' date='23 April 2012 - 05:23 PM' timestamp='1335216213' post='1400039']
According to the programming guide, the minimum latency is 400 cycles. Of course this is only a ballpark estimate, but it seems to fit observations across a number of GPUs, at least in microbenchmarks. (May be not on Kepler.)

In the light of that, < 100 cycles latency on a GPU without caches is quite surprising. When you say cuda-core cycles do you mean the slower "core clock"/"graphics clock" that is ~500 MHz or the faster "shader clock"/"processor clock" that is 1400 MHz or so?

Loaded and unloaded latency are surely different but that makes it only more interesting. So, what is the maximum latency? The programming guide cites 800 cycles, is it anywhere close to truth?

Could you also say a few words about how you measure it? Do you use clock()/clock64()? Where do you place them? Do you check that compiler doesn't move these instructions around? Does this instrumentation affect the performance of the kernel? (Writing the timings to memory may consume substantial bandwidth.)
[/quote]
I talk about cuda core clock (so around 1100Mhz on my old 9400M GT equipped laptop), 9400M GT (ICP79 chipset) use laptop DDR3 that seems to have less latency that GDDR, thus the lower latency.
For the maximum latency, it varies also on en the GPU generation, for example the worse latencies I had was mixing read+write on pre-Fermi GeForce, while on GCN the L2 cache seems very efficient to hide writes latencies and read latencies seems lower.

I measure them statistically after a warm-up, because what's interest me is not the highest latency, but the average latency (including the warp effect that keep up to 32 threads waiting for the last thread when they all read from global memory simultaneously).
[quote name='vvolkov' date='23 April 2012 - 05:23 PM' timestamp='1335216213' post='1400039']

According to the programming guide, the minimum latency is 400 cycles. Of course this is only a ballpark estimate, but it seems to fit observations across a number of GPUs, at least in microbenchmarks. (May be not on Kepler.)



In the light of that, < 100 cycles latency on a GPU without caches is quite surprising. When you say cuda-core cycles do you mean the slower "core clock"/"graphics clock" that is ~500 MHz or the faster "shader clock"/"processor clock" that is 1400 MHz or so?



Loaded and unloaded latency are surely different but that makes it only more interesting. So, what is the maximum latency? The programming guide cites 800 cycles, is it anywhere close to truth?



Could you also say a few words about how you measure it? Do you use clock()/clock64()? Where do you place them? Do you check that compiler doesn't move these instructions around? Does this instrumentation affect the performance of the kernel? (Writing the timings to memory may consume substantial bandwidth.)



I talk about cuda core clock (so around 1100Mhz on my old 9400M GT equipped laptop), 9400M GT (ICP79 chipset) use laptop DDR3 that seems to have less latency that GDDR, thus the lower latency.

For the maximum latency, it varies also on en the GPU generation, for example the worse latencies I had was mixing read+write on pre-Fermi GeForce, while on GCN the L2 cache seems very efficient to hide writes latencies and read latencies seems lower.



I measure them statistically after a warm-up, because what's interest me is not the highest latency, but the average latency (including the warp effect that keep up to 32 threads waiting for the last thread when they all read from global memory simultaneously).

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#13
Posted 04/24/2012 06:06 PM   
[quote name='vvolkov' date='24 April 2012 - 11:00 AM' timestamp='1335290457' post='1400414']
It's in the new Programming Guide, Section F.5.3: "Shared memory has 32 banks... Each bank has a bandwidth of 64 bits per clock cycle."
[/quote]

I'm hoping that means the bank can supply 2 separate 4-byte accesses per clock cycle. That would make sense in a practically 32 bit single-precision device.

EDIT: Nope. From section F.5.3 it appears that the bank can "resolve the conflict" only if the same warp accesses the same 64-bit segment.
[quote name='vvolkov' date='24 April 2012 - 11:00 AM' timestamp='1335290457' post='1400414']

It's in the new Programming Guide, Section F.5.3: "Shared memory has 32 banks... Each bank has a bandwidth of 64 bits per clock cycle."





I'm hoping that means the bank can supply 2 separate 4-byte accesses per clock cycle. That would make sense in a practically 32 bit single-precision device.



EDIT: Nope. From section F.5.3 it appears that the bank can "resolve the conflict" only if the same warp accesses the same 64-bit segment.

#14
Posted 04/24/2012 06:18 PM   
I always like to see some numbers...
From what I can tell, the memory bandwidth has been kept about the same as for the Fermi generation.
Comparing a GTX 480 to a GTX 680 I get 147GB/s vs 149GB/s for a device to device transfer.

As the memory clock rate has been increased strongly, the latency for memory access should be less if there is little memory pressure.
For this example, I can give you some numbers (measured with the same test as described in [url="http://forums.nvidia.com/index.php?showtopic=203627"]Fermi L2 cache[/url])

[code]
Latency in cycles:
GTX 480 GTX 680
non-cached 492 300
L2 258 162
L1 20 20

atomic non-cached 822 357
atomic L2 584 214
atomic non-cached 2x conflict 808 357
atomic L2 2x conflict 572 226
atomic non-cached 4x conflict 830 349
atomic L2 4x conflict 600 278
atomic non-cached 8x conflict 1336 386
atomic L2 8x conflict 1120 386
atomic non-cached 16x conflict 2384 610
atomic L2 16x conflict 2166 610
atomic non-cached 32x conflict 4016 1058
atomic L2 32x conflict 3800 1058
[/code]

Although the memory latency has been reduced relatively to the clock rate of the device, one should consider that these numbers really don't mean a lot for your algorithm's performance.
Given the fact that the number of cores went up and the memory bandwidth stayed about the same I would expect that the concept of "arithmetic cycles are for free" applies even more strongly for the current Kepler cards. This might also be the reason why our previous algorithms work not so much better on Kepler...

Anyway, I think the atomic operation performance is also interesting. They really seem to work a lot faster than on Fermi. :)
I always like to see some numbers...

From what I can tell, the memory bandwidth has been kept about the same as for the Fermi generation.

Comparing a GTX 480 to a GTX 680 I get 147GB/s vs 149GB/s for a device to device transfer.



As the memory clock rate has been increased strongly, the latency for memory access should be less if there is little memory pressure.

For this example, I can give you some numbers (measured with the same test as described in Fermi L2 cache)





Latency in cycles:

GTX 480 GTX 680

non-cached 492 300

L2 258 162

L1 20 20



atomic non-cached 822 357

atomic L2 584 214

atomic non-cached 2x conflict 808 357

atomic L2 2x conflict 572 226

atomic non-cached 4x conflict 830 349

atomic L2 4x conflict 600 278

atomic non-cached 8x conflict 1336 386

atomic L2 8x conflict 1120 386

atomic non-cached 16x conflict 2384 610

atomic L2 16x conflict 2166 610

atomic non-cached 32x conflict 4016 1058

atomic L2 32x conflict 3800 1058




Although the memory latency has been reduced relatively to the clock rate of the device, one should consider that these numbers really don't mean a lot for your algorithm's performance.

Given the fact that the number of cores went up and the memory bandwidth stayed about the same I would expect that the concept of "arithmetic cycles are for free" applies even more strongly for the current Kepler cards. This might also be the reason why our previous algorithms work not so much better on Kepler...



Anyway, I think the atomic operation performance is also interesting. They really seem to work a lot faster than on Fermi. :)

#15
Posted 04/26/2012 10:59 AM   
  1 / 2    
Scroll To Top