How to disable/enable ECC on C2050?

I have a C2050 card, which should have ECC capability. But turning on ECC will reduce the available memory to 2.65GB. I want to disable this feature. However, I don’t know how to do it. This is not mentioned in the programming guide, nor in the reference manual.

How can I disable ECC on C2050?

Which OS? Under Windows I believe it is a control panel option.

Sorry I didn’t mention that. I am running on RHEL 5.3 x86_64.

On linux, nvidia-smi 3.0 now has a new option -e for controlling ECC.

yeah, use -e to set the mode, then reboot

Just to make sure, these are production models, as in we can buy them now?

Hey pacard,

What kind of numbers do you get (GFLOP/sec) for your CUBLAS SGEMM and DGEMM on the Tesla c2050?

Using the drivers included with the Tesla C2050 and NVIDA’s CUBLAS 3.0, double precision performance of DGEMM has increased from 69 GFLOPS to 163 GLFOPS when compared to a Telsa C1060. CUBLAS’s SGEMM performance however has dropped to 307 GFLOPS from 344 GFLOPS.

NVIDIA has addressed the SGEMM performance, though:

  • SGEMM performance on Fermi-based GPU is 30% lower than expected. It will be fixed in 3.1.

Also related, the CULA team will posting our “plug and play” C2050 benchmarks very shortly for our LAPACK libraries.

So the performance numbers for DGEMM in Tesla is similar to the one in GeForce? I thought DP performance for Fermi Tesla was around 4 times better than for Fermi GeForce.

It is, but the memory bandwidth isn’t four times as large (and I am guessing the calculation is still memory bandwidth bound).

I’m starting to conclude that the general obsession with Fermi double precision GFLOPS is unwarranted unless the calculation can make good use of the cache hierarchy. As usual, memory bandwidth can’t grow as fast as floating point throughput.

Oh so the GEMM GFLOPS performance number includes the cudamemcpy from CPU (GPU) to GPU (CPU), correct?

I don’t believe so.

The C1060 had roughly 100Gb/s of global memory bandwidth for 80 GFlops peak, the C2050 has about 140Gb/s for about 500 GFlop/s peak . On Fermi I would expect that DGEMM will be memory bandwidth bound just like SGEMM was on the C1060, so all things being equal Fermi should hit half the SGEMM rate in DGEMM. Which looks about what is happening…

Luckily the new caches make it easier to become computationally limited. But that may require slight reorganizations to really get the big boosts from it. For example, matrix solvers might do very well breaking into tiles of about 20K or so (so everything stays in L2 cache) then assembling into larger grids hierarchically. With the old-school CUDA, there’s no cache so you tend to either micro-tile (say 2K or so so you can use shared mem) or just use raw bandwidth and brute force the full size matrix with lots of redundant reads. The caches make things a lot easier… free decent performance for all code, and free excellent performance if you adapt your code to fit your device. This is true on the CPU as well of course… perhaps more with the three-level cache as compared to Fermi’s two-level.

3.0 CUBLAS is not well optimized for Fermi.

So does anyone have a guess on roughly how much DGEMM performance improvement (From Cublas 3.0) we’ll get for an optimized DGEMM GPU routine that fully utilizes the Fermi architecture?

Peak double precision performance is around 500 GFLOPS on the C2050. However, considering that a very highly tuned SGEMM only reaches about 60% of peak performance, I’d estimate that about 300 GFLOPs will be reached by a highly tuned and Fermi optimized DGEMM.

What about CUFFT?

That brings the following question. Was this number achieved with the standard setting of 48kB shared / 16 kB L1 cache or the other way around? The second option might have better performance for non-fermi-adapted code.

Not in 3.0, no. I think perf for both should be significantly improved in 3.1.