Strange ECC mode reported by nvidia-smi.exe

ben.hines · May 22, 2018, 1:10pm

Prior to running a CUDA-based program on my workstation, I ran the following command to see the state of the GPUs:

nvidia-smi.exe

And this is what was reported:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03                 Driver Version: 391.03                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 31%   45C    P8    22W / 110W |    109MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K5200        TCC  | 00000000:23:00.0 Off |                    0 |
| 26%   36C    P8    20W / 150W |     66MiB /  7597MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Then, I run a program that allocates some GPU memory on the K5200 using cudaMalloc, which results in an error being reported by the program. I then run nvidia-smi again, like so:

nvidia-smi.exe

This time, I get a strange result. The ECC state for the K5200 is listed as ‘2’. This is not one of the modes that is listed in the man pages for nvidia-smi and I’d like to know what this means. Does this indicate some kind of unrecoverable ECC error?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03                 Driver Version: 391.03                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 32%   48C    P8    25W / 110W |    135MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K5200        TCC  | 00000000:23:00.0 Off |                    2 |
| 26%   36C    P8    20W / 150W |    132MiB /  7597MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

My hunch is that the ‘2’ does indicate some kind of error because if I execute the following command, the ECC state goes back to ‘0’:

nvidia-smi.exe -i 1 -p 0

After executing the above command, executing just nvidia-smi again will give this result, which is the same as what I am used to seeing on my other workstations:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 391.03                 Driver Version: 391.03                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 31%   45C    P8    22W / 110W |    135MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K5200        TCC  | 00000000:23:00.0 Off |                    0 |
| 26%   37C    P8    20W / 150W |    132MiB /  7597MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Any insight on what an ECC state of ‘2’ means would be most appreciated.

Thanks.

cbuchner1 · May 22, 2018, 1:57pm

It’s not a mode. It’s the number of uncorrectable ECC errors you’ve had.

The RAM on the board may be faulty - or you’ve just had bad luck with some high energy photons messing up the state of some DRAM cells.

ben.hines · May 22, 2018, 2:14pm

Ah. Looks like I misunderstood what that number meant. Thanks.

njuffa · May 22, 2018, 3:38pm

As cbuchner says “Uncorr. ECC” is the number of times the ECC mechanism found an uncorrectable error. The GPU uses a SECDED (single error correct, double error detect) version of ECC. From your description, it seems like the count increased by two in a short amount of time.

Normally the rate of uncorrectable ECC errors should be very low: I’d say about one per year for a GPU like yours in continuous operation (based on personal experience, I don’t have hard statistics at hand). You may want to look at the complete ECC error statistics with nvidia-smi, in particular the number of (corrected) single-bit errors. In normal circumstances, the number of single-bit errors should be significantly higher than the number of double-bit errors.

If you see the count increasing further in this manner, and you are not operating this machine in an environment with increased amounts of radiation, for example at high altitude (e.g. in an airplane), or in close vicinity to a man-made radiation source (e.g. an x-ray machine), this may be an indication that the memory on this GPU faulty. I seem to recall from error statistics on supercomputers that memory errors are also positively correlated with operating temperature, so make sure cooling works well (e.g. unobstructed air flow).

An elevated rate of uncorrectable ECC errors is potentially worrisome, because this means the GPU is now known to compute with corrupted data, which may not be acceptable where the GPU is used in a mission-critical capacity. My memory is hazy, but I believe one can configure the system to halt on uncorrectable ECC errors? That may be preferable in such cases.

Here is a slide deck from an interesting GTC presentation on error statistics from the GPU-accelerated Titan supercomputer at Oakridge. While Titan uses an older generation of GPUs (Kepler architecture), the overall lessons from Oaridge likely still apply to current architectures, even if specific numbers will differ:

[url]http://on-demand.gputechconf.com/gtc/2015/presentation/S5566-James-Rogers.pdf.[/url]

ben.hines · May 22, 2018, 4:19pm

Thanks for the information. Very helpful!

ben.hines · November 15, 2018, 5:44pm

Hey,

I’d like to revisit this issue. I’m seeing similar behavior on machines other than my development machine. The strange thing that I would like to point out is that the uncorrected ECC error count that I see is always ‘2’. Is this expected? It seems suspicious that I’m seeing the same count every time this happens. Even on different machines.

njuffa · November 15, 2018, 6:51pm

Weird. How many different GPUs are we talking about? Two, ten, a hundred? How long have the GPUs been in operation. As I stated earlier, an error rate of one uncorrectable error per year might be expected, so if you looked at three two-year-old GPUs, the error counter might show two errors for each.

Assuming it is at least ten, my only speculative hypothesis at this time is that this is a consequence of a burn-in test that checks proper operation of the ECC reporting, either performed by the vendor or by a third party. But I have no idea how a specialized test (using radiation, for example) would provoke precisely two uncorrectable errors.

I think one can reset the ECC error counts with nvidia-smi. When you do that, and check the counts after a while, do you again see precisely two uncorrectable errors reported, across multiple GPUs?