Why Tesla?

Hi,

We are planning to expand our GPUs. We now have a GTX 280 and can get exceptional performance on my computing. But so far I have run only short simulations and I plan to run longer ones which can lost several days to weeks on GPUs.

I would like to know why should I get Tesla? What are the aspects in which I can expect better performance compared to GTX 280? I would like to justify the extra cost I need to invest to buy Tesla’s against GTX 280’s.

Any response will be appreciated.

thanks,
Vishu

Tesla has the following advantages:

  • 4 Gb of memory per GPU
  • for the S1070 & S1075: a little bit more processing power.
  • a high GPU density per rackspace (S1070 and even more S1075)
  • more testing of the hardware before it is shipped (as far as I know)
    and the following disadvantages:
  • a lower memory bandwidth

So I think, unless you are going to put this in a rack, a GTX280 is the best.

Thanks!!! I probably won’t rack them; I reckon as of now I don’t have strong reasons to get Tesla boards except for the 4GB memory which is something that I should see if its potentially useful for my work.

Some one mentioned to me about the hardware stability/performance and was wondering if GTX 280’s can perform at the same level when running codes for days and weeks. Any inputs on this aspect?

  1. The chip runs cooler than GTX 280.
  2. More memory.
  3. Much more rigorous testing.
  4. Actual server form-factor (have fun fitting four GTX 280s in a 1U space).
  5. Perf will be better on the higher-clocked S1070 if you’re compute bound, but it will be worse if you’re bandwidth bound (the tradeoff for 4x memory and probably an order of magnitude in reliability).
  6. 2 6-pins or 1 8-pin required, none of this 6+8 nonsense.

(if you’re running for days or weeks, I wouldn’t trust your average GTX 280 that much simply because they’ve never been tested to do that)

Assuming your computer has good cooling and power supply, I’ve never seen a problem with the consumer cards. Beware of the extremely overclocked consumer cards, but the ones running at close to the design spec should be fine. My jobs tend to take days, though I’ve chained them up before and kept a few 8800 GTX cards busy for 2 weeks. The longest I’ve ever run my GTX 280 on one job is 4 days. (Note that my code doesn’t keep the GPU 100% busy all the time. It makes large numbers of short calls.)

Reliability is probably the most important issue, and tmurray listed all relevant points. For some of my papers, I have kept consumer level cards burning away for a couple of days (similar to seibert’s code, many scripted tests in a row) without problems, they were as reliable as the Quadros that were the equivalent professional level cards at that time (no Teslas yet). But that’s just a second positive example. If you can get your systems vendor to give a proper warrant that the system you buy is certified to keep the GPU busy for ages, then that’s a big plus and should influence your decision. I’ve had some fun discussions with resellers that were obviously not aware of GPU computing and thus told me “we won’t take the card back if you burn it in such an abusive way” (I burned one card after two months and they refused to replace it)

Thanks for all the responses above! They help a lot. We do need a run large simulations for long times. Perhaps we may try a mix of GTX’s and Tesla’s in our lab. Yes, I should talk to vendors and ask for details of warranty!!

That’s the thing that bothers me. No actually knows or tested anything to say “GTX 280 is unreliable.” Even NVIDIA apparently “never tested them.” So how can there be ANY meaning to the words “Tesla is more reliable than GTX 280”? We have no hard data on GeForce reliability, we have no hard data on Tesla reliability. Going by anecdotal evidence, however, Geforces have orders of magnitude more evidence for stability in real-world use than do Teslas. Nvidia just repackaged the same product and is trying to take advantage of the different price sensitivity of the commercial market, the same as it’s always done with Quadros, and playing mindgames to reinforce a perceived differentiation.

I should point out, though, that there are various brands that make GeForces, and some are more expensive than others. A high-end brand such as EVGA is a safe choice with an excellent warranty and, yes, thorough testing.

This all could be much, much clearer if NVIDIA would put a MTBF number on the Tesla (S1070/S1075) page. Currently it really is ‘they are tested more thoroughly’, and while I trust Tim on his word, my bosses prefer numbers.

What exactly is meant by ‘testing’? Are we talking about testing the chips before they are integrated onto a board or testing the finished product before selling it? If we’re speaking of the latter, then wouldn’t that be up to the individual card manufacturers (EVGA, PNY, etc.), and not necessarily up to Nvidia? Maybe I don’t fully understand the relationship between Nvidia and these companies. Perhaps there are some testing guidelines set by Nvidia that the companies doing the actual manufacturing are required to follow.

We’re only exploring CUDA at the moment, so we have a GTX 280 here that we develop on. However, I could see this ‘geforce vs tesla’ discussion becoming an issue that we’re interested in exploring at some point in the future.

Hi,

Would you consider building a cluster of GTX280 instead of a cluster of Teslas? what host computer would you recommend in both cases?

thanks

eyal

For the future I will definitely switch to S1070 stuff. It is nicely scalable in a rack, so for a cluster I would not think more than 1 second to arrive at S1070/S1075. For my current processing needs, a single GTX280 will probably be enough, so for the short run I will likely take a PC with GTX280 and make that fit in our current form factor (S1070/S1075 also does not fit)

If you are talking about a cluster with dozens of cards or more, you’re talking about rackmount cases, which pretty much means Tesla. While it is not hard to build or buy a workstation that can accept two GTX 280 cards, it is more difficult to find a high-density rackmount case that could accept the double width GTX 280, much less power it. At that point, I think it is worth it to buy the rackmount Telsa product and get 4 cards per 1U.

(I can also see a market for a 2U system that combines the 4 Tesla with a quad-core CPU and motherboard for a more self-contained product.)

I can see that market too. If NVIDIA would make a quad-core CPU 2U system with configurable amount of RAM and PCI-E v2 x16 connectors to the two switches, it would be a no-brainer for me.

Which leads me to the following question:

Which rack-mounted system would you guys recommend for connecting a S1070 to, preferable DELL, as we have a corporate contract with them. The systems from DELL listed are as far as I know all having x8 versions for PCI-E connections… It looks like the only system with 2x x16 connectors are the HP one, where one of the slots is being taken by a RAID card I believe.

Take a look at the DELL R5400.

I think you guys are forgetting that you’re not limited to 1U or 2U when mounting in a rack. There are 4U rackmount chasses, and their layout is analogous to a desktop case. They use standard ATX boards and take PCI cards perpendicular to their plane. Four GTX260s should fit (motherboard permitting), and there will be room for a large power supply and good airflow as well. Tesla not necessary.

Except that with a 1U server and 2x S1075, you can get 8 GPUs in 3U of space ;) But I think the main things is about taking something off the shelf with shock numbers and environmentals provided, so there is a whole validation step that can be omitted compared to building something yourself. And I can tell you that in a lot of companies, those processes take a lot of time and money.

That’s true. (Except the part about 8 GPUs in 3U… I don’t think you can put two PCI cards into a 1U server.)

Although, to give an example… the cooling solution on each GeForce (or C1060 for that matter) is massive, dedicated and the culmination of a decade of innovation. (No, really. If you’ve tracked video card cooling devices you’d see how far technology has come. In fact, the once-vibrant aftermarket for GPU heatsinks, which catered to overclockers, has all but disappeared in the past several years because the massive block in which a latest-gen GPU is entombed virtually cannot be improved upon.) By contrast, an S1070 doesn’t have so much as a dedicated fan per card, and I’ve never seen an independent test (much less a dozen) on how it holds up. The S1070 packs 800W per U, which is awe-inspiring and a little scary.

Again, NVIDIA providing assurances about environmentals is nice, and some companies have strict policies about validation which they’ll forego if they’re given a promise by a vendor. But I’m just not sure there’s real substance to it. It stil seems the only objective difference between a GeForce and a Tesla is the price.

In any case, keep in mind how the question of GeForce vs Tesla is centrally tied to scale. If the scale is 1 GPU for a developer, the money is miniscule and the question is small. If the scale is 4 GPUs in a server, the money is small and the question of brief pause. But the bigger you make the cluster, the more things start to matter. (Even if you consider in-house validation, at some scale, probably much less than 100 GPUs, the savings will outweight costs. The question of Tesla vs GeForce may not be if, but when.)

What’s needed, I think, are some software tools. One tool would be a GPU stress-testing/error-checking program. Another would monitor data such as fan speed and temperature and send it over the network. Both of these tools would not be new, but simply modeled on software used by enthusiasts and overclockers. Moreover, they’re just as critical for a deployment of GeForces as of Teslas.

So I want you know,why it expensive? It has many many functions.Include DX OpenGL & Scientific …
I can say: It is the first Programable GPU. For example: Later NVIDIA will release DX11 Driver.

So you can choose it! :rolleyes: ;) :D