the superior 680 / 690 gpu how many cycles is 32 x 32 == 64 bits integer
hi,

First congratulations to Nvidia for releasing this total kick butt gpu and also the fast release of the 690,
i guess the 256 bits wide connection from gpu to memory made it possible to quickly release a double gpu card.

Now that will annihilate. Of course we look ahead in big excitement when the Tesla version of it will release.

Someone approached me (Jeff Gilchrist) how fast for current codes the 680 / 690 would be.
the complicated operation getting done in this CUDA code is multiplication of unsigned 32 bits integers,
though signed 32 bits integers is also possible for this factorisation code.

AMD is a turtle there (needs 4 PE's @ 1 compute core) to do this. So effectively a 2048 core gpu reduces there to 512 cores.
Of course it's 2 instructions then so 2 cycles.

Fermi is fast here.

How fast is the 680/690 here, is 32 bits integer multiplication (be it signed or unsigned) a fast instruction at it,
and if so how many cores does it need to carry it out?

Disclaimer: Of course if i ask for 'cycles' i ask for the maximum throughput one would manage to achieve.

For gamerskids seems to me this 680 and 690 release are very BIG releases, really overthrowing AMD and taking the crown back to nvidia in a very convincing manner.

the idea to throw more cores into battle is the right choice - now i hope the instructions i need didn't become slower...

For prime numbers more bits is better... ...all this is lossless calculations. If you use such huge prime numbers in fact using floating point becomes ever more complicated. In case of factorisation, integers is total superior, just like in computerchess it is.

Thanks,
Vincent
diep@xs4all.nl
diepchess
hi,



First congratulations to Nvidia for releasing this total kick butt gpu and also the fast release of the 690,

i guess the 256 bits wide connection from gpu to memory made it possible to quickly release a double gpu card.



Now that will annihilate. Of course we look ahead in big excitement when the Tesla version of it will release.



Someone approached me (Jeff Gilchrist) how fast for current codes the 680 / 690 would be.

the complicated operation getting done in this CUDA code is multiplication of unsigned 32 bits integers,

though signed 32 bits integers is also possible for this factorisation code.



AMD is a turtle there (needs 4 PE's @ 1 compute core) to do this. So effectively a 2048 core gpu reduces there to 512 cores.

Of course it's 2 instructions then so 2 cycles.



Fermi is fast here.



How fast is the 680/690 here, is 32 bits integer multiplication (be it signed or unsigned) a fast instruction at it,

and if so how many cores does it need to carry it out?



Disclaimer: Of course if i ask for 'cycles' i ask for the maximum throughput one would manage to achieve.



For gamerskids seems to me this 680 and 690 release are very BIG releases, really overthrowing AMD and taking the crown back to nvidia in a very convincing manner.



the idea to throw more cores into battle is the right choice - now i hope the instructions i need didn't become slower...



For prime numbers more bits is better... ...all this is lossless calculations. If you use such huge prime numbers in fact using floating point becomes ever more complicated. In case of factorisation, integers is total superior, just like in computerchess it is.



Thanks,

Vincent

diep@xs4all.nl

diepchess

#1
Posted 05/01/2012 07:49 PM   
It seems that integer multiply-add (IMAD) on GTX 680 runs 6x slower than in single precision floating point (FFMA). Apparently, only 32 cores out of 192 on each SM can do it.
It seems that integer multiply-add (IMAD) on GTX 680 runs 6x slower than in single precision floating point (FFMA). Apparently, only 32 cores out of 192 on each SM can do it.

#2
Posted 05/01/2012 08:00 PM   
[quote name='vvolkov' date='01 May 2012 - 10:00 PM' timestamp='1335902443' post='1403142']
It seems that integer multiply-add (IMAD) on GTX 680 runs 6x slower than in single precision floating point (FFMA). Apparently, only 32 cores out of 192 on each SM can do it.
[/quote]

Thanks for the answer!

when this gets poured more accurately and a tad lower clocked probably and unlocked for Tesla purposes what ratio and speeds can we expect then that Nvidia squeezes out of this for the gpgpu professionals?

Now where the Fermi generation was high clocked in gamerscard version, they really had to clock it down a lot. The C2075's here is 1.15Ghz it reports - but this new generation is already a 1Ghz, so should give them a better clock for the Tesla version (as compared to the gamerscard clocks), maybe?

What throughput in number of instructions can you push through this 680/690 generation cards by the way? Please ignore multply -add, the most useless instruction ever for the fastest possible FFT's there are, maybe. Multiplication matters though, yet not relevant to measure throughput, so should get mixed in at a ratio that on paper it can handle in later measurements. What throughput should be the result of this. It really can push through the advertized amount of 1536 * 1G instructions per second? (for the instructions that it's supposed to push through at a throughput cost of 1 cycle per instruction)

As unlocking that potential should result in a double precision non-multiply-add flops (double precision) of around a 0.750+ Tflop and the theoretic calculation of multiply-add added in can give it then for the Tesla incarnation a 1.5 Tflop each gpu?

Also would 2 gpu tesla's be in the planning department as the 256 bits will allow easily put 2 gpu's on a single card, yes?

So 3 Tflop double precision card we might be able to expect in some time (say a year or so from now)?

Is that potential there in this new genius chip?
[quote name='vvolkov' date='01 May 2012 - 10:00 PM' timestamp='1335902443' post='1403142']

It seems that integer multiply-add (IMAD) on GTX 680 runs 6x slower than in single precision floating point (FFMA). Apparently, only 32 cores out of 192 on each SM can do it.





Thanks for the answer!



when this gets poured more accurately and a tad lower clocked probably and unlocked for Tesla purposes what ratio and speeds can we expect then that Nvidia squeezes out of this for the gpgpu professionals?



Now where the Fermi generation was high clocked in gamerscard version, they really had to clock it down a lot. The C2075's here is 1.15Ghz it reports - but this new generation is already a 1Ghz, so should give them a better clock for the Tesla version (as compared to the gamerscard clocks), maybe?



What throughput in number of instructions can you push through this 680/690 generation cards by the way? Please ignore multply -add, the most useless instruction ever for the fastest possible FFT's there are, maybe. Multiplication matters though, yet not relevant to measure throughput, so should get mixed in at a ratio that on paper it can handle in later measurements. What throughput should be the result of this. It really can push through the advertized amount of 1536 * 1G instructions per second? (for the instructions that it's supposed to push through at a throughput cost of 1 cycle per instruction)



As unlocking that potential should result in a double precision non-multiply-add flops (double precision) of around a 0.750+ Tflop and the theoretic calculation of multiply-add added in can give it then for the Tesla incarnation a 1.5 Tflop each gpu?



Also would 2 gpu tesla's be in the planning department as the 256 bits will allow easily put 2 gpu's on a single card, yes?



So 3 Tflop double precision card we might be able to expect in some time (say a year or so from now)?



Is that potential there in this new genius chip?

#3
Posted 05/02/2012 01:49 AM   
I could get 3160 Gflop/s in FFMA. This is 5.495 SIMD instructions per cycle per SM per cycle. Or 175.9 "scalar instructions" per SM per cycle. You'd expect 192, but I get 16 less.

Throughput of FADD and FMUL in the number of instructions per cycle are same.

Tell me what instructions you care about, I'll check them out.
I could get 3160 Gflop/s in FFMA. This is 5.495 SIMD instructions per cycle per SM per cycle. Or 175.9 "scalar instructions" per SM per cycle. You'd expect 192, but I get 16 less.



Throughput of FADD and FMUL in the number of instructions per cycle are same.



Tell me what instructions you care about, I'll check them out.

#4
Posted 05/02/2012 02:36 AM   
[quote name='vvolkov' date='02 May 2012 - 04:36 AM' timestamp='1335926195' post='1403255']
I could get 3160 Gflop/s in FFMA. This is 5.495 SIMD instructions per cycle per SM per cycle. Or 175.9 "scalar instructions" per SM per cycle. You'd expect 192, but I get 16 less.

Throughput of FADD and FMUL in the number of instructions per cycle are same.

Tell me what instructions you care about, I'll check them out.
[/quote]

That's really great 175.9 out of 192 theoretical! So no lies by Nvidia there, they really DELIVER! This will kick butt for the gamers!

Many thanks for performing these tests!!

As for what i do here with respect to gpu's that's very different workloads.

Parametertuning (non-lineair) needs single precision signed integers, not floats, and multiplication of them; however as in my chessprogram i modelled things in 20 bits, the multiplication needs at least 40 bits precision output.

So modelling that in 32 * 32 == 64 bits is the easy way out (24 * 24 bits is complex manner as collecting all the bits then is also extra instructions) for those matrixcalculations. This is not generic matrix calculations, yet a special type where the way how to multiply and add things up is defined by a grammar, which gets autogenerated at cpu's. All this still needs to get setup; most parameter tuning is total top secret, even the way how it gets done is top secret; as i just produce games it is not so secret what i do, the result is more important than the method how to accomplish things - known methods forget it, that's a few kids with a professor title who just crabbled something to not look like being busy, yet what they crabbled down won't work for 20k+ parameters - it's all innovative methods toying with something until it works. When it works usually someone else gives it a name and grabs the honor, whereas it already was in my lab here or some other guy who in reality invented it...

The difference in artificial intelligence versus the parameter tuning already known is the big accuracy that we discover last decade is needed to rival the human mind. Right now some of the top engines get away with just a few parameters tuned really really well. My chess engine Diep which has huge parameter amounts as it has more knowledge requires a different level of tuning there - far bigger challenge and with a few cpu cores you won't manage of course. GPU's really are needed to help out there. Nvidia's Tesla definitely qualifies for that like nothing else does!

This is all integers work though.

Factorisation i do in sparetime, or better idle time of my equipment that is focussed upon integers as big as possible. The multiplication is the focus and bottleneck there.

So preferably 64x64 bits == 128, as gpu's don't have that 32 * 32 == 64 gets used. Speed of multiplication = speed of your app,
even more than what i'm busy setting up.

Most idle time is there at start of trajectories. Once the chess has been setup the testing of it really carries on 24/24 and has 100% loads on the hardware, yet these GPU's are so fast that their impact at the Wagstaff project i run is really huge.

FFT can be done in 2 ways, both with unsigned integers as well as with the common double precision ones. Both have their purpose.

Practical this means adapting the FFT released by Nvidia to something you can use. I have a book here called, it's by Crandall and especially Pomerance, which drop a few notes on this; fastest transform that's currently getting used for double precision is DWT.

For integers it's all not so clear, the national belly feelings of big integer transforms seems to overshadow publication interests of efficient integer transforms.

I'm also busy desiging a O (n log n) type multiplication without using complex math. I feel it's doable.
Yet the computer will play a major role in finding that algorithm for me. So that'll require major number crunching and pray it finds an algorithm that works.

Once it has something that works it'll be probably easy to rewrite it to something that looks nice.

In the end we all need to progress science and one of the major bottlenecks for all sciences is fast and accurate multiplication.

Floating point gets used not because it's ideal; in fact it's asking for trouble always because of the backtracking of errors; especially in quantum mechanics this seems to be a big problem. So unvisible to the public the real important thing is lossless integers and their role will become even more important in the future as for real big transforms/matrice, the double precision transforms will always give you a risk of backtracking incorrectly, which is just total unacceptable if you look at it; the scientists should be busy with the results instead of worry about errors getting backtracked. So where most shout out loud for more double precision,as a few benchmarks use double precision, that's just the outside. The real top guys have proven already over and over again that lossless calculations are preferred as we've already seen before in how man occasions they managed to design new theories, for example in quantum mechanics, that explained a phenomena that wasn't there simply - just the round off errors caused it.

So i expect integers to get more important - as for the GPU's they aren't calculating in 64 bits integers yet in 32 bits; obviously if one has to choose between 32 * 32 == 64 bits or 53 bits mantissa * 53 == 53 bits double precision, then double precision IS faster; so that's why most use double precision at gpu's and nvidia's library - yet if the gpu would have 64 * 64 == 128 bits, most would switch i bet.

If you look at the cpu's, they just do not have a fast 64 * 64 == 128 bits option. They only got 32 * 32 == 64 bits.

No wonder they go for the faster transform then, which CAN backtrack errors, and they usually fall in that trap. Only a few top guys do not.
[quote name='vvolkov' date='02 May 2012 - 04:36 AM' timestamp='1335926195' post='1403255']

I could get 3160 Gflop/s in FFMA. This is 5.495 SIMD instructions per cycle per SM per cycle. Or 175.9 "scalar instructions" per SM per cycle. You'd expect 192, but I get 16 less.



Throughput of FADD and FMUL in the number of instructions per cycle are same.



Tell me what instructions you care about, I'll check them out.





That's really great 175.9 out of 192 theoretical! So no lies by Nvidia there, they really DELIVER! This will kick butt for the gamers!



Many thanks for performing these tests!!



As for what i do here with respect to gpu's that's very different workloads.



Parametertuning (non-lineair) needs single precision signed integers, not floats, and multiplication of them; however as in my chessprogram i modelled things in 20 bits, the multiplication needs at least 40 bits precision output.



So modelling that in 32 * 32 == 64 bits is the easy way out (24 * 24 bits is complex manner as collecting all the bits then is also extra instructions) for those matrixcalculations. This is not generic matrix calculations, yet a special type where the way how to multiply and add things up is defined by a grammar, which gets autogenerated at cpu's. All this still needs to get setup; most parameter tuning is total top secret, even the way how it gets done is top secret; as i just produce games it is not so secret what i do, the result is more important than the method how to accomplish things - known methods forget it, that's a few kids with a professor title who just crabbled something to not look like being busy, yet what they crabbled down won't work for 20k+ parameters - it's all innovative methods toying with something until it works. When it works usually someone else gives it a name and grabs the honor, whereas it already was in my lab here or some other guy who in reality invented it...



The difference in artificial intelligence versus the parameter tuning already known is the big accuracy that we discover last decade is needed to rival the human mind. Right now some of the top engines get away with just a few parameters tuned really really well. My chess engine Diep which has huge parameter amounts as it has more knowledge requires a different level of tuning there - far bigger challenge and with a few cpu cores you won't manage of course. GPU's really are needed to help out there. Nvidia's Tesla definitely qualifies for that like nothing else does!



This is all integers work though.



Factorisation i do in sparetime, or better idle time of my equipment that is focussed upon integers as big as possible. The multiplication is the focus and bottleneck there.



So preferably 64x64 bits == 128, as gpu's don't have that 32 * 32 == 64 gets used. Speed of multiplication = speed of your app,

even more than what i'm busy setting up.



Most idle time is there at start of trajectories. Once the chess has been setup the testing of it really carries on 24/24 and has 100% loads on the hardware, yet these GPU's are so fast that their impact at the Wagstaff project i run is really huge.



FFT can be done in 2 ways, both with unsigned integers as well as with the common double precision ones. Both have their purpose.



Practical this means adapting the FFT released by Nvidia to something you can use. I have a book here called, it's by Crandall and especially Pomerance, which drop a few notes on this; fastest transform that's currently getting used for double precision is DWT.



For integers it's all not so clear, the national belly feelings of big integer transforms seems to overshadow publication interests of efficient integer transforms.



I'm also busy desiging a O (n log n) type multiplication without using complex math. I feel it's doable.

Yet the computer will play a major role in finding that algorithm for me. So that'll require major number crunching and pray it finds an algorithm that works.



Once it has something that works it'll be probably easy to rewrite it to something that looks nice.



In the end we all need to progress science and one of the major bottlenecks for all sciences is fast and accurate multiplication.



Floating point gets used not because it's ideal; in fact it's asking for trouble always because of the backtracking of errors; especially in quantum mechanics this seems to be a big problem. So unvisible to the public the real important thing is lossless integers and their role will become even more important in the future as for real big transforms/matrice, the double precision transforms will always give you a risk of backtracking incorrectly, which is just total unacceptable if you look at it; the scientists should be busy with the results instead of worry about errors getting backtracked. So where most shout out loud for more double precision,as a few benchmarks use double precision, that's just the outside. The real top guys have proven already over and over again that lossless calculations are preferred as we've already seen before in how man occasions they managed to design new theories, for example in quantum mechanics, that explained a phenomena that wasn't there simply - just the round off errors caused it.



So i expect integers to get more important - as for the GPU's they aren't calculating in 64 bits integers yet in 32 bits; obviously if one has to choose between 32 * 32 == 64 bits or 53 bits mantissa * 53 == 53 bits double precision, then double precision IS faster; so that's why most use double precision at gpu's and nvidia's library - yet if the gpu would have 64 * 64 == 128 bits, most would switch i bet.



If you look at the cpu's, they just do not have a fast 64 * 64 == 128 bits option. They only got 32 * 32 == 64 bits.



No wonder they go for the faster transform then, which CAN backtrack errors, and they usually fall in that trap. Only a few top guys do not.

#5
Posted 05/02/2012 08:24 AM   
Scroll To Top