Would people from nVidia shed some light on 16bit and 32bit integer arithmethic performance? There is lots of talk about float point performance in the doc, but only a sentance or two on 16bit and 32bit arithmetic performance. I would like to know more about integer arithmetic latencies as most of my code involves 16bit and 32bit, both signed and unsigned operations.
Thanks.
I am trying number cruncher application using integer.
I know only #6.1.1.1 of the Programming Guide says “any integer operations and 24-bit multiplication are about same as most of float operations”
I would like to know, too, how performance of integer operation is.
Wai, have you checked out *.ptx intermediate file?
I guess it is interesting.
My application uses lots of “Lookup tables” and it might make its bottleneck.
I will post questions when I gave up tuning.
I would also be interested to hear if anyone has noticed a performance improvement when compiling the .cubin with -fastimul (24bit integer)
Peter
A:
int a = foo();
int b = bar();
int ab = a * b;
B:
int a = foo();
int b = bar();
int ab = __mul24(a,b);
Code B (or code A compiled with --fastimul) should definitely compile to fewer instructions than code A (with default compiler options).
On G80:
24-bit integer multiplies are full-speed. 32-bit integer multiplies require a multiple instruction sequence.
32-bit float mul, add, and mad, and 32-bit integer add, shifts, and logic operations are full speed.
full-speed = 2 cycles per 32-thread warp.
Mark
Can you tell us what the penalty is in machine cycles?