Kepler global memory latency What is it?

How do we find out Kepler’s global memory latency(etc.?)

Thanks.

Has anyone updated this for 64-bit pointers, as mentioned on the page?

http://www.stuffedcow.net/research/cudabmk

I used microbenchmarks and got 270 cycles if unloaded, i.e. minimum latency. For the latency under load (memcpy) I got 450-1400 cycles on average.

However, my EVGA GTX680 has only 700 MHz clock rate, I wonder why…

The nvidia-settings tool (and possibly other device queries) is not reporting the correct values in the current driver:

Apparently, the dynamic clocking in GK104 is making this confusing.

Edit: Oops. Didn’t see you got the same answer already in another thread. :)

From my experience, memory latency is variable depending on the load on the memory bus AND the load pattern itself (coalesced or not, 32bit, 64 bit, 128 bits, streaming, random, etc)

Bother to share any particular numbers?

They just have no meaning, if you compare a single memory access on a 9400M GT that is pretty fast (memory latency is low, it could be under 100 cuda-core cycles!) to a fully-loaded Fermi card all cuda core trying to random read the memory simultaneously (and at least 6 warp per SM), you may be over 1500 cycles of latency.

Coalescing is ALWAYS better, the worst case being simultaneous random read & write naturally (if you except atomic operations).

RezaRob3, depending on my project and optimization time allowed (sic), when it’s possible, I begin by comparing memory access pattern writing a mock code that just duplicate the memory access pattern that the algorithms will use. As global memory (and local memory too) is the usual bottleneck on serious GPGPU development, when it needs to be fast it,s the first thing I consider, even trading-off computing-cycle to lessen global memory access.

Last example is on the GPGPU OpenCL Chess Engine I am writing on my spare time, where I will generate move twice for each move to avoid storing it on memory (such a pity!). It’s faster to re-generate a move in this particular case than to wait for memory read!

According to the programming guide, the minimum latency is 400 cycles. Of course this is only a ballpark estimate, but it seems to fit observations across a number of GPUs, at least in microbenchmarks. (May be not on Kepler.)

In the light of that, < 100 cycles latency on a GPU without caches is quite surprising. When you say cuda-core cycles do you mean the slower “core clock”/“graphics clock” that is ~500 MHz or the faster “shader clock”/“processor clock” that is 1400 MHz or so?

Loaded and unloaded latency are surely different but that makes it only more interesting. So, what is the maximum latency? The programming guide cites 800 cycles, is it anywhere close to truth?

Could you also say a few words about how you measure it? Do you use clock()/clock64()? Where do you place them? Do you check that compiler doesn’t move these instructions around? Does this instrumentation affect the performance of the kernel? (Writing the timings to memory may consume substantial bandwidth.)

Seibert, thanks a lot for the cudabmk link.

And what about the Kepler/GTX-680 shared memory throughput? It seems like 32 memory banks is quite limiting for 192 cores!
Do we yet know what’s in the “7 billion transistor” version?

banks are twice as wide - 8 bytes per bank.

Yuk! How do you optimize for that? Wouldn’t you have to do an 8-byte read to take full advantage of it?

Also, how do you know that?

Thanks.

Yep, I guess you have…

It’s in the new Programming Guide, Section F.5.3: “Shared memory has 32 banks… Each bank has a bandwidth of 64 bits per clock cycle.”

I talk about cuda core clock (so around 1100Mhz on my old 9400M GT equipped laptop), 9400M GT (ICP79 chipset) use laptop DDR3 that seems to have less latency that GDDR, thus the lower latency.

For the maximum latency, it varies also on en the GPU generation, for example the worse latencies I had was mixing read+write on pre-Fermi GeForce, while on GCN the L2 cache seems very efficient to hide writes latencies and read latencies seems lower.

I measure them statistically after a warm-up, because what’s interest me is not the highest latency, but the average latency (including the warp effect that keep up to 32 threads waiting for the last thread when they all read from global memory simultaneously).

I’m hoping that means the bank can supply 2 separate 4-byte accesses per clock cycle. That would make sense in a practically 32 bit single-precision device.

EDIT: Nope. From section F.5.3 it appears that the bank can “resolve the conflict” only if the same warp accesses the same 64-bit segment.

I always like to see some numbers…

From what I can tell, the memory bandwidth has been kept about the same as for the Fermi generation.

Comparing a GTX 480 to a GTX 680 I get 147GB/s vs 149GB/s for a device to device transfer.

As the memory clock rate has been increased strongly, the latency for memory access should be less if there is little memory pressure.

For this example, I can give you some numbers (measured with the same test as described in Fermi L2 cache)

Latency in cycles:

                                  GTX 480    GTX 680

non-cached                         492        300

L2                                 258        162

L1                                  20         20

atomic non-cached                  822        357

atomic L2                          584        214

atomic non-cached  2x conflict     808        357

atomic L2          2x conflict     572        226

atomic non-cached  4x conflict     830        349

atomic L2          4x conflict     600        278

atomic non-cached  8x conflict    1336        386

atomic L2          8x conflict    1120        386

atomic non-cached 16x conflict    2384        610

atomic L2         16x conflict    2166        610

atomic non-cached 32x conflict    4016       1058

atomic L2         32x conflict    3800       1058

Although the memory latency has been reduced relatively to the clock rate of the device, one should consider that these numbers really don’t mean a lot for your algorithm’s performance.

Given the fact that the number of cores went up and the memory bandwidth stayed about the same I would expect that the concept of “arithmetic cycles are for free” applies even more strongly for the current Kepler cards. This might also be the reason why our previous algorithms work not so much better on Kepler…

Anyway, I think the atomic operation performance is also interesting. They really seem to work a lot faster than on Fermi. :)

pinta, thanks a lot for those numbers and the link(I still haven’t read the test source code.)

I absolutely agree: lowering the bandwidth per core is NOT fun! However, the atomics improvements seem to be super duper. It seems that they might be doing smarter atomics-aware scheduling of warps, which would be great.

Thanks for the numbers pinta!

Keep in mind though that a lot of the latency improvements in cycles just comes from the slower clock - latency in nanoseconds shows less of an improvement (at least it’s still an improvement though).

EDIT: move bracket to the right place

Guess I should have included the latency in ns right away:

Latency in cycles/ns:

                                  GTX 480        GTX 680

non-cached                        492 /  351     300 /  298

L2                                258 /  184     162 /  161

L1                                 20 /   14      20 /   20

atomic non-cached                 822 /  587     357 /  355

atomic L2                         584 /  417     214 /  213

atomic non-cached  2x conflict    808 /  577     357 /  355

atomic L2          2x conflict    572 /  408     226 /  225

atomic non-cached  4x conflict    830 /  593     349 /  347

atomic L2          4x conflict    600 /  428     278 /  276

atomic non-cached  8x conflict   1336 /  954     386 /  384

atomic L2          8x conflict   1120 /  800     386 /  384

atomic non-cached 16x conflict   2384 / 1702     610 /  606

atomic L2         16x conflict   2166 / 1547     610 /  606

atomic non-cached 32x conflict   4016 / 2869    1058 / 1052 

atomic L2         32x conflict   3800 / 2714    1058 / 1052

Oh yes, very good point tera. However, assuming this is part of significantly scaling the transistor count at reasonable power consumption, I’m personally happy with clock cycle as a latency measure. Absolute timings does not matter quite so much to data processing.

That said, things like memory throughput per core matter a lot, as many others in these forums have been discussing!

Pinta, great numbers on atomics. Thanks again. :) That does seem to point to better (atomics aware) warp scheduling.