Instruction Latency
  1 / 2    
Does anyone know the latency for bitwise instructions or logical comparisons?
Does anyone know the latency for bitwise instructions or logical comparisons?

#1
Posted 05/02/2008 09:16 PM   
The programming guide knows:

"4 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction"

:)
The programming guide knows:



"4 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction"



:)

#2
Posted 05/12/2008 12:23 AM   
That refers to throughput for a warp, not latency.
That refers to throughput for a warp, not latency.

#3
Posted 05/12/2008 04:04 PM   
As a related guideline, the programming guide does state that 6 warps (or 192 threads) per multiprocessor are needed to hide register dependencies.

Paulius
As a related guideline, the programming guide does state that 6 warps (or 192 threads) per multiprocessor are needed to hide register dependencies.



Paulius

#4
Posted 05/12/2008 06:50 PM   
Well, that implies over twenty cycles of latency. Does that merely mean that the pipeline has twenty plus stages? Or does that it mean two sequential machine instructions in a single thread must have must have twenty cycles between them to avoid stalls? Many CPUs have deep pipelines but utilize a variety of hardware techniques to eliminate data hazards.
Well, that implies over twenty cycles of latency. Does that merely mean that the pipeline has twenty plus stages? Or does that it mean two sequential machine instructions in a single thread must have must have twenty cycles between them to avoid stalls? Many CPUs have deep pipelines but utilize a variety of hardware techniques to eliminate data hazards.

#5
Posted 05/18/2008 02:43 AM   
Yes, there are over 20 cycles of latency. But unlike a serial processor that needs to use lots of hardware techniques to hide this latency, the GPU can just run enough threads to hide this latency. (We trade all that hardware complexity for more computation units.)

So, as Paulius says, as long as you have at least 6 warps of threads resident, you will hide the ALU pipeline latency. Note that these 6 warps need not be from the same thread block. So even if you have 64-thread blocks, as long as each uses at most 25% of the register (up to 42 registers per thread) and at most 25% of shared memory (4096 bytes per thread block), you will be able to hide this latency.

In other words: 25% occupancy to hide ALU pipeline latency.

Mark
Yes, there are over 20 cycles of latency. But unlike a serial processor that needs to use lots of hardware techniques to hide this latency, the GPU can just run enough threads to hide this latency. (We trade all that hardware complexity for more computation units.)



So, as Paulius says, as long as you have at least 6 warps of threads resident, you will hide the ALU pipeline latency. Note that these 6 warps need not be from the same thread block. So even if you have 64-thread blocks, as long as each uses at most 25% of the register (up to 42 registers per thread) and at most 25% of shared memory (4096 bytes per thread block), you will be able to hide this latency.



In other words: 25% occupancy to hide ALU pipeline latency.



Mark

#6
Posted 05/22/2008 10:26 AM   
This is a very interesting thread. I've found it through the AnandTech [url="http://anandtech.com/video/showdoc.aspx?i=3336"]article[/url] that tries to draw some conclusions from your discussions. However, I am not fully convinced that things have been named explicitly here so I'll ask for a more specific answer to chris22's lat post.

Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?

If it's the former, does it mean that we have in fact four small, 5- or 6-stage deep pipelines concatenated into one long SP pipeline?
This is a very interesting thread. I've found it through the AnandTech article that tries to draw some conclusions from your discussions. However, I am not fully convinced that things have been named explicitly here so I'll ask for a more specific answer to chris22's lat post.



Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?



If it's the former, does it mean that we have in fact four small, 5- or 6-stage deep pipelines concatenated into one long SP pipeline?

#7
Posted 03/17/2009 11:40 AM   
[quote name='Axure' post='519308' date='Mar 17 2009, 12:40 PM']Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?[/quote]

20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.

Where the 20-cycle figure comes from :
- operand gather:
Input operands from registers are read sequentially from the same bank(s) of the register file.
Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).
(actually 1 to 6 cycles as the RF clock is half the shader clock)
- execution:
The 32 operations of the warp are dispatched to the available units.
Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.
Then the execution pipe latency is around 10 cycles.
- readback:
Same as gather, takes 2 to 4 cycles.

Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.
GT200 added two more cycles to most (all?) instructions compared to G9x.

Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses...
[quote name='Axure' post='519308' date='Mar 17 2009, 12:40 PM']Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?



20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.



Where the 20-cycle figure comes from :

- operand gather:

Input operands from registers are read sequentially from the same bank(s) of the register file.

Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).

(actually 1 to 6 cycles as the RF clock is half the shader clock)

- execution:

The 32 operations of the warp are dispatched to the available units.

Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.

Then the execution pipe latency is around 10 cycles.

- readback:

Same as gather, takes 2 to 4 cycles.



Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.

GT200 added two more cycles to most (all?) instructions compared to G9x.



Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses...
#8
Posted 03/21/2009 10:44 AM   
Wow, that's an impressive analysis.

Thanks a lot for taking the time to explain! :)
Wow, that's an impressive analysis.



Thanks a lot for taking the time to explain! :)

#9
Posted 03/21/2009 11:39 AM   
[quote name='Sylvain Collange' post='521004' date='Mar 21 2009, 10:44 AM']20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.

Where the 20-cycle figure comes from :
- operand gather:
Input operands from registers are read sequentially from the same bank(s) of the register file.
Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).
(actually 1 to 6 cycles as the RF clock is half the shader clock)
- execution:
The 32 operations of the warp are dispatched to the available units.
Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.
Then the execution pipe latency is around 10 cycles.
- readback:
Same as gather, takes 2 to 4 cycles.

Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.
GT200 added two more cycles to most (all?) instructions compared to G9x.

Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses...[/quote]

Hi, Sylvain,

I execute dependent operation a += b*c to measure pipeline latency.
my GPU is 9600GT, launch only one thread.
let a, b, c all reside in register, latency is 24 cycles;
let a, c reside in register, b in constant memory, latency is 22 cycles;
let a, c reside in register, b in share memory, latency is 26 cycles.

Does the difference of operand selection result in above results?
Can anyone give me a detailed explanation or other related measure data?

Do you have some comment on my test?
[quote name='Sylvain Collange' post='521004' date='Mar 21 2009, 10:44 AM']20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.



Where the 20-cycle figure comes from :

- operand gather:

Input operands from registers are read sequentially from the same bank(s) of the register file.

Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).

(actually 1 to 6 cycles as the RF clock is half the shader clock)

- execution:

The 32 operations of the warp are dispatched to the available units.

Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.

Then the execution pipe latency is around 10 cycles.

- readback:

Same as gather, takes 2 to 4 cycles.



Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.

GT200 added two more cycles to most (all?) instructions compared to G9x.



Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses...



Hi, Sylvain,



I execute dependent operation a += b*c to measure pipeline latency.

my GPU is 9600GT, launch only one thread.

let a, b, c all reside in register, latency is 24 cycles;

let a, c reside in register, b in constant memory, latency is 22 cycles;

let a, c reside in register, b in share memory, latency is 26 cycles.



Does the difference of operand selection result in above results?

Can anyone give me a detailed explanation or other related measure data?



Do you have some comment on my test?

#10
Posted 03/27/2009 05:39 AM   
[quote name='xjtusnail' post='523291' date='Mar 27 2009, 06:39 AM']Hi, Sylvain,

I execute dependent operation a += b*c to measure pipeline latency.
my GPU is 9600GT, launch only one thread.
let a, b, c all reside in register, latency is 24 cycles;
let a, c reside in register, b in constant memory, latency is 22 cycles;
let a, c reside in register, b in share memory, latency is 26 cycles.

Does the difference of operand selection result in above results?
Can anyone give me a detailed explanation or other related measure data?

Do you have some comment on my test?[/quote]

Your results are consistent with those obtained by vvolkov in his paper:
[url="http://bebop.cs.berkeley.edu/pubs/volkov2008-benchmarking.pdf"]http://bebop.cs.berkeley.edu/pubs/volkov20...enchmarking.pdf[/url]

I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).

Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.
For the gory details about registers, refer to [url="http://www.patents.com/Simulating-multiported-memories-using-lower-port-count-memories/US7339592/en-US/"]this patent[/url].
[quote name='xjtusnail' post='523291' date='Mar 27 2009, 06:39 AM']Hi, Sylvain,



I execute dependent operation a += b*c to measure pipeline latency.

my GPU is 9600GT, launch only one thread.

let a, b, c all reside in register, latency is 24 cycles;

let a, c reside in register, b in constant memory, latency is 22 cycles;

let a, c reside in register, b in share memory, latency is 26 cycles.



Does the difference of operand selection result in above results?

Can anyone give me a detailed explanation or other related measure data?



Do you have some comment on my test?



Your results are consistent with those obtained by vvolkov in his paper:

http://bebop.cs.berkeley.edu/pubs/volkov20...enchmarking.pdf



I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).



Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.

For the gory details about registers, refer to this patent.
#11
Posted 03/27/2009 08:45 PM   
[quote name='Sylvain Collange' post='523523' date='Mar 27 2009, 09:45 PM']Your results are consistent with those obtained by vvolkov in his paper:
[url="http://bebop.cs.berkeley.edu/pubs/volkov2008-benchmarking.pdf"]http://bebop.cs.berkeley.edu/pubs/volkov20...enchmarking.pdf[/url]

I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).

Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.
For the gory details about registers, refer to [url="http://www.patents.com/Simulating-multiported-memories-using-lower-port-count-memories/US7339592/en-US/"]this patent[/url].[/quote]

Thank you very much.
BTW, below is my test data.
Operation Latency a b c
a+=b*c 22.01 reg constant memory reg
a+=b*c 24.003 reg reg reg
a+=b*c 26.006 reg share memory reg

a+=b 20.01 reg constant memory
a+=b 20.006 reg reg
a+=b 24.007 reg share memory

a*=b 20.01 reg constant memory
a*=b 20.009 reg reg
a*=b 24.007 reg share memory

Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?
if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?
[quote name='Sylvain Collange' post='523523' date='Mar 27 2009, 09:45 PM']Your results are consistent with those obtained by vvolkov in his paper:

http://bebop.cs.berkeley.edu/pubs/volkov20...enchmarking.pdf



I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).



Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.

For the gory details about registers, refer to this patent.



Thank you very much.

BTW, below is my test data.

Operation Latency a b c

a+=b*c 22.01 reg constant memory reg

a+=b*c 24.003 reg reg reg

a+=b*c 26.006 reg share memory reg



a+=b 20.01 reg constant memory

a+=b 20.006 reg reg

a+=b 24.007 reg share memory



a*=b 20.01 reg constant memory

a*=b 20.009 reg reg

a*=b 24.007 reg share memory



Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?

if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?

#12
Posted 03/28/2009 06:43 AM   
[quote name='xjtusnail' post='523625' date='Mar 28 2009, 07:43 AM']Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?[/quote]

Well, you already answered your own question. ;)
Also try:
a=b*a+a
a=a*a+a
a=a+a
a=a*a
...

[quote]if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?[/quote]

For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).
Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.
[quote name='xjtusnail' post='523625' date='Mar 28 2009, 07:43 AM']Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?



Well, you already answered your own question. ;)

Also try:

a=b*a+a

a=a*a+a

a=a+a

a=a*a

...



if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?




For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).

Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.
#13
Posted 03/29/2009 11:03 AM   
[quote name='Sylvain Collange' post='523956' date='Mar 29 2009, 12:03 PM']Well, you already answered your own question. ;)
Also try:
a=b*a+a
a=a*a+a
a=a+a
a=a*a
...



For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).
Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.[/quote]

thank you, Sylvain.
[quote name='Sylvain Collange' post='523956' date='Mar 29 2009, 12:03 PM']Well, you already answered your own question. ;)

Also try:

a=b*a+a

a=a*a+a

a=a+a

a=a*a

...







For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).

Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.



thank you, Sylvain.

#14
Posted 03/29/2009 02:10 PM   
Does anyone know the throughput and latency of integer divide and modulus operations?

Or have source code for these types of tests?
Does anyone know the throughput and latency of integer divide and modulus operations?



Or have source code for these types of tests?

#15
Posted 03/29/2009 05:07 PM   
  1 / 2    
Scroll To Top