GTX 480 / 470 Double Precision Reduced?

Hi I know that question popped up before, but I hope if we ask enough someone official from NVIDIA will tell us how it is ;) .

Is the double precision performance of the consumer fermi cards reduced (by 75%) compared to that of the Tesla Line?

Best regards
Ceearem

And if so, could it be re-enabled by flipping a bit in the driver? ;)

It seems that gaming cards such as GTX480 do not have increased dp performance. Are you sure that nvidia advertised such a feature in GTX480?

The point is I found no official statement that the gaming cards have reduced dp performance - they use the fermi chip, the same as the Tesla and the Quadro Cards will use. And the architecture of the fermi chip allows for double precision performance at half speed of single precision. So it is never a question of whether the tesla cards have increased DP performance but if the consumer cards have a reduced one. If they have less DP Performance than the Tesla cards, this will be because its either disabled through drivers or some kind of hardware jumper. So the question still remains if NVIDIA decided to “criple” the DP performance on the consumer cards in order to have one more advantage for the Tesla cards besides much more memory, higher reliability [I assume here that NVIDIA “hand picks” the chips for their professional cards and tests them more throughly than the chips for consumer cards] and ecc support.

Soooo … tmurray any comment?? External Media

Best regards

Ceearem

I’d like to know it too.

The following answer I received a few months ago from James Wang, a technical marketing analyst from NVIDIA:

Q: In the GeForce family, double-precision throughput has been reduced to 25% of the full design. Was this decision made to discourage the use of these products for professional use (where Quadro and Tesla are targeted?) Considering the fused support of single- and double-precision calculations in the CUDA cores, how was this change even applied?

A: Yes, full-speed double precision performance is a feature we reserve for our professional customers. Consumer applications have little use for double precision, so this does not really affect GeForce users. Having differentiated features and pricing is actually fairer for all. Given the option of enabling all professional features on GeForce and having gamers pay for them, or disabling them on GeForce and offering a more compelling price, we feel the latter is the better choice.

Regarding the second part of the question, the architecture is designed to support this kind of configuration.

Argh, too bad. At least now there is a significant feature to drive individual Tesla sales aside from memory size (and ECC). I never saw any compelling reason to put a C1060 into a developer workstation unless you needed 4 GB of memory.

My code continues to avoid double precision (mostly because development started on compute 1.0 devices), and it looks like it will be profitable to continue that trend when possible, if only to target GeForce cards.

oh shiit, I missed that part of their ‘strategy’. i hate them for that :-(, and especially for lying about the supposed additional costs of NOT disabling DP, for which the gamers would have to pay.

still, for the list price of c2060 you can have ~5 gtx480 (with 1/4 DP and 1/2 memory) so even for pure dp performance it doesn’t necessarily make sense to buy the overpriced C & S products.

the only serious reason would be if the air-cooled gtx cards fail [more than Teslas]. do they???

I’m counting on a clever hack of a driver, some day by someone.

On a separate note, if someone needs support for a summer job dealing with toptimization of gpu drivers… :-)

True but the card price is not the only thing, since you need the pc as well where you put it in.

Thats a point I would guess is true. First of all I would definitely think that the professional cards (Tesla and Quadro) hae “hand picked” chips, and I guess they are better tested.

I wouldn’t count on that, since this could probably be implemented by an “hardware jumper”. For the newer quadro cards this is the way they made sure that you cannot use the quadro drivers (with much better performance in CAT etc.) with the consumer products.

I think the box with cpu etc. inside is ~$2k, so the multiple cards inside are by far more expensive (5+ times more for 4-card node).

Well, I don’t begrudge them for trying to make CUDA sustainable with non-gamer income. The gamer market is running out of steam to fund the R&D for better GPU Computing features, and there are not enough other consumer-aimed compute heavy tasks to pick up the slack. (I would argue the reception of the GTX 480/470 by the review sites is lukewarm for this reason.) The HPC community is much smaller, so you have to extract more $$$ per card to maintain the same income. If this is what it takes to keep CUDA alive, so be it. (Of course, I’d love for double precision to become a must-have feature for GeForce customers. Whoever can release those applications does all of us a favor.)

Tesla cards tend to run at lower clock rates than the top-of-the-line GeForce, probably in part for this reason. However, even if GeForce is less reliable, you would need the failure rate to be 5x the Tesla for that to be cost effective in a workstation where you don’t need extremely high uptime.

Still getting good mileage with float here. And must admit, getting a bit agitated by numbers

like these: folding on GTX 480. Those last two graphs, ray tracing and folding, for real? External Media

Such is the magic of an L2 cache when your working set of data (or some part of it) can fit inside the cache.

Not to forget the random access problem in main memory, and or atomic functions.

So on highly streamed non-branching double percision code which is faster, Fermi or 5xxx?

Will we see benchmark results showing Tesla>5xxx>gtx480 for double percision GFlops?

Its a pity Anandtech didn’t include some double percision compute benchmarks both of raw performance and on more complex problems.

I read in one of the reviews that RD 5870 has about 2700 Gflops computational performance in single precision, while GTX 480 has about 1300, if that was true, would’nt 5870 beat the crap out of GTX 480 in every game?

Yes, except either:

  1. ATI’s drivers are poor

  2. It is very difficult to get anywhere near peak performance from the Evergreen/Cypress architecture

I’d bet on #2. Looking at the architecture, it seems that Cypress is designed explicitly for graphics (hence the 4-way VLIW execution units). And yet, even for games, Cypress is about even with Fermi. However, seemingly more efficient ($ and Watt) for graphics vs Fermi.

Hard to tell though because ATI/AMD has no decent documentation.

  • Matt

Where is this in writing, other than a forum post?! After pushing how great Fermi would be for CUDA, NVidia needs to be honest about the capabilities of the consumer cards. I’m not overly upset by the decision (nor surprised), but this needs to be clear.

I think you can achieve peak flops on ATI, it’s the NVIDIA cards that you can’t achieve peak

i found this on a *modified SGEMM for ATI
[url=“http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=127963&enterthread=y”]http://forums.amd.com/forum/messageview.cf...p;enterthread=y[/url]

on 4000 series they get 1Tflop, NVIDIA gets ~375Gflops

But if I read the concerning threads correct this is a very specific example, and you don’t even compare apples with apples since in the Ati example the matrizes are in special order.

Also peak performance in some examples is actually not that important from my point of view. A very important question is how easy it is to programm, and how much effort you need to do to get close to it.

Here a list for Folding@home

http://www.pcgameshardware.de/aid,667155/F…74&vollbild

And here recent benchmark in OpenCL during the GTX480/470 tests at anandtech:

http://www.anandtech.com/video/showdoc.aspx?i=3783&p=6

So while the ATI cards are in theory much faster than the NVIDIA cards, I think its harder to write effective code for them than for NVIDIAs GPUs.

Best regards

Ceearem