Why use Titan over K20 in non-cluster environment

I am using 2 Titan cards for development in a PC under Windows. For production, I am assessing whether to use Titan or K20, given that the production environment is a single server (not a cluster).

Given that I am not concerned about ECC, non-cluster environment and I could live with not using RDP, are there any advantages or even disadvantages to take the K20 option?

N.B. I am not in a position to buy two K20s to do a benchmarking.

(Disclaimer: I don’t have a Titan yet, so this is based on my research while evaluating the card.)

One thing to note is that the Titan is closer to the K20X, rather than the K20. The K20 has only 13 SMXs and a 320-bit memory bus, where as the K20X has 14 SMXs and a 384-bit memory bus, like Titan. Titan has higher core and memory clock rates, so it is faster than the K20X. According to Anandtech, if you enable full-speed double precision on the Titan in the driver, the core clock rate will be limited due to the larger thermal load. Based on statements in their review, even if the card underclocks itself, it should still be faster than the K20X at double precision as well.

I think the main limitation of the Titan for your scenario is the performance of the non-TCC driver in Windows. There have been many complaints about launch overhead and kernel launch latency with GeForce cards on Windows. The other major limitation is the number of enabled DMA engines on the card. A Tesla card can do simultaneous bi-directional transfer with its two DMA engines, whereas Titan only has one enabled.

A less concrete issue is the lack of support and testing the GeForce cards have for 24/7 production CUDA use. I have never worried about this since I use CUDA exclusively in a research context, but I could imagine how it might be an issue depending on your situation. However, I have no idea what kind of support you get with a Tesla purchase, and whether that is worth the extra 2*$2800 to you.

Seibert, thank you for the feedback. It is closer to K20X but K20X is more expensive compared to K20 given the difference of features between the two.

Launching a kernel in the Titan is 15 micro seconds (reported by the visual profiler), that to me is not a lot. I am not using DMA (Did you mean UVA, because this is supported in Titan)

I agree on the less testing bit, but aren’t the gamers heavy users who run the card for continuous hours? :)

The DMA engine on the GPU is used whenever you transfer data over the PCI-Express link (cudaMemcpy and friends). GeForce cards can overlap a memory transfer in one direction (host->device or device->host) with kernel execution, but Tesla cards can simultaneously execute a kernel, a host->device transfer, and a device->host transfer thanks to two enabled DMA engines.

UVA works on pretty much everything these days, so I agree that is not a discriminating factor.

I only point out the testing bit for full disclosure. The benefit of QA testing is statistical one, so it is hard to evaluate its value from small sample sizes. Over 6 years of CUDA use, I’ve had roughly 3 out of 16 GeForce boards used exclusively for CUDA fail, in all cases at least a year after purchase. If the failure of a GPU in your production system is a huge problem, then the lower clock rates and greater testing of a Tesla GPU could be valuable. If failure has a low cost, then GeForce is much more economical. I certainly have not heard of any GeForce vendor refusing to replace a GPU under warranty because it was used for CUDA calculations over a long period of time.