Integer Arithmetic 32 integer arithmetic performance

Would people from nVidia shed some light on 16bit and 32bit integer arithmethic performance? There is lots of talk about float point performance in the doc, but only a sentance or two on 16bit and 32bit arithmetic performance. I would like to know more about integer arithmetic latencies as most of my code involves 16bit and 32bit, both signed and unsigned operations.

Thanks.

I am trying number cruncher application using integer.

I know only #6.1.1.1 of the Programming Guide says “any integer operations and 24-bit multiplication are about same as most of float operations”
I would like to know, too, how performance of integer operation is.

Wai, have you checked out *.ptx intermediate file?
I guess it is interesting.

My application uses lots of “Lookup tables” and it might make its bottleneck.
I will post questions when I gave up tuning.

I would also be interested to hear if anyone has noticed a performance improvement when compiling the .cubin with -fastimul (24bit integer)

Peter

A:

int a = foo();

int b = bar();

int ab = a * b;

B:

int a = foo();

int b = bar();

int ab = __mul24(a,b);

Code B (or code A compiled with --fastimul) should definitely compile to fewer instructions than code A (with default compiler options).

On G80:

24-bit integer multiplies are full-speed. 32-bit integer multiplies require a multiple instruction sequence.

32-bit float mul, add, and mad, and 32-bit integer add, shifts, and logic operations are full speed.

full-speed = 2 cycles per 32-thread warp.

Mark

Can you tell us what the penalty is in machine cycles?