Prior to running a CUDA-based program on my workstation, I ran the following command to see the state of the GPUs:
nvidia-smi.exe
And this is what was reported:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03 Driver Version: 391.03 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4200 WDDM | 00000000:03:00.0 On | N/A |
| 31% 45C P8 22W / 110W | 109MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K5200 TCC | 00000000:23:00.0 Off | 0 |
| 26% 36C P8 20W / 150W | 66MiB / 7597MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Then, I run a program that allocates some GPU memory on the K5200 using cudaMalloc, which results in an error being reported by the program. I then run nvidia-smi again, like so:
nvidia-smi.exe
This time, I get a strange result. The ECC state for the K5200 is listed as ‘2’. This is not one of the modes that is listed in the man pages for nvidia-smi and I’d like to know what this means. Does this indicate some kind of unrecoverable ECC error?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03 Driver Version: 391.03 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4200 WDDM | 00000000:03:00.0 On | N/A |
| 32% 48C P8 25W / 110W | 135MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K5200 TCC | 00000000:23:00.0 Off | 2 |
| 26% 36C P8 20W / 150W | 132MiB / 7597MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
My hunch is that the ‘2’ does indicate some kind of error because if I execute the following command, the ECC state goes back to ‘0’:
nvidia-smi.exe -i 1 -p 0
After executing the above command, executing just nvidia-smi again will give this result, which is the same as what I am used to seeing on my other workstations:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03 Driver Version: 391.03 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4200 WDDM | 00000000:03:00.0 On | N/A |
| 31% 45C P8 22W / 110W | 135MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K5200 TCC | 00000000:23:00.0 Off | 0 |
| 26% 37C P8 20W / 150W | 132MiB / 7597MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Any insight on what an ECC state of ‘2’ means would be most appreciated.
Thanks.