Instruction Latency

Does anyone know the latency for bitwise instructions or logical comparisons?

The programming guide knows:

“4 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction”

:)

That refers to throughput for a warp, not latency.

As a related guideline, the programming guide does state that 6 warps (or 192 threads) per multiprocessor are needed to hide register dependencies.

Paulius

Well, that implies over twenty cycles of latency. Does that merely mean that the pipeline has twenty plus stages? Or does that it mean two sequential machine instructions in a single thread must have must have twenty cycles between them to avoid stalls? Many CPUs have deep pipelines but utilize a variety of hardware techniques to eliminate data hazards.

Yes, there are over 20 cycles of latency. But unlike a serial processor that needs to use lots of hardware techniques to hide this latency, the GPU can just run enough threads to hide this latency. (We trade all that hardware complexity for more computation units.)

So, as Paulius says, as long as you have at least 6 warps of threads resident, you will hide the ALU pipeline latency. Note that these 6 warps need not be from the same thread block. So even if you have 64-thread blocks, as long as each uses at most 25% of the register (up to 42 registers per thread) and at most 25% of shared memory (4096 bytes per thread block), you will be able to hide this latency.

In other words: 25% occupancy to hide ALU pipeline latency.

Mark

1 Like

This is a very interesting thread. I’ve found it through the AnandTech article that tries to draw some conclusions from your discussions. However, I am not fully convinced that things have been named explicitly here so I’ll ask for a more specific answer to chris22’s lat post.

Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?

If it’s the former, does it mean that we have in fact four small, 5- or 6-stage deep pipelines concatenated into one long SP pipeline?

20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.

Where the 20-cycle figure comes from :

  • operand gather:

    Input operands from registers are read sequentially from the same bank(s) of the register file.

    Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).

    (actually 1 to 6 cycles as the RF clock is half the shader clock)

  • execution:

    The 32 operations of the warp are dispatched to the available units.

    Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.

    Then the execution pipe latency is around 10 cycles.

  • readback:

    Same as gather, takes 2 to 4 cycles.

Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.

GT200 added two more cycles to most (all?) instructions compared to G9x.

Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses…

Wow, that’s an impressive analysis.

Thanks a lot for taking the time to explain! :)

Hi, Sylvain,

I execute dependent operation a += b*c to measure pipeline latency.

my GPU is 9600GT, launch only one thread.

let a, b, c all reside in register, latency is 24 cycles;

let a, c reside in register, b in constant memory, latency is 22 cycles;

let a, c reside in register, b in share memory, latency is 26 cycles.

Does the difference of operand selection result in above results?

Can anyone give me a detailed explanation or other related measure data?

Do you have some comment on my test?

Your results are consistent with those obtained by vvolkov in his paper:

http://bebop.cs.berkeley.edu/pubs/volkov20…enchmarking.pdf

I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).

Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.

For the gory details about registers, refer to this patent.

Thank you very much.

BTW, below is my test data.

Operation Latency a b c

a+=b*c 22.01 reg constant memory reg

a+=b*c 24.003 reg reg reg

a+=b*c 26.006 reg share memory reg

a+=b 20.01 reg constant memory

a+=b 20.006 reg reg

a+=b 24.007 reg share memory

a*=b 20.01 reg constant memory

a*=b 20.009 reg reg

a*=b 24.007 reg share memory

Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?

if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?

Well, you already answered your own question. ;)

Also try:

a=b*a+a

a=a*a+a

a=a+a

a=a*a

…

For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).

Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.

thank you, Sylvain.

Does anyone know the throughput and latency of integer divide and modulus operations?

Or have source code for these types of tests?

The problem with these is that they are not instructions but done in software. As the code is not branch-free, the performance will depend on what you put into them. Joy.

By the way, it is not clear to me how the operation a += bc; takes only ~ 24 clock cycles.
b
c would require 20+ clock cycles to be finished. Then it needs
another 20+ clock cycles for the addition which has to wait till the multiplication is finished.

Thanks

I don’t know the low-level stuff as well as Sylvain Collange or Gregory Diamos, but if I had to guess, it would be that a += b*c is actually using an FMA (fused-multiply-add) instruction (a = a + b * c)…which probably doesn’t have a whole lot more latency than a single multiple or add instruction.

Thanks profquail, this makes sense.