GPU in state where results are not reproducible!

eelsen · May 9, 2011, 5:56pm

I just noticed that on my development machine (4.0RC2) the results of my program suddenly starting not making any sense at all. I assumed I had introduced a bug etc…but eventually realized that even the reduction SDK example fails to give reproducible results. Example from 3 separate runs:

GPU result = 2139351770
CPU result = 2139353471

GPU result = 2139348901
CPU result = 2139353471

GPU result = 2139349872
CPU result = 2139353471

This has actually happened once before, both on GTX570s, about 1-2 months old. When it happened previously I assumed it was simply an unlucky hardware failure, but when the failing card was placed in another machine, everything seemed to be working just fine.

If this isn’t a hardware failure, does anyone have any thoughts as to the cause? Remedies? When it happened with the previous card a soft reset did NOT solve the problem! Any other diagnostic/debugging information that would be helpful?

Edited to add: I just verified that a soft reset did not solve the problem this time either, almost every one of the SDK examples fails with results failing to match the CPU.

Further edited: Power cycling the computer does fix the problem. So then my question is - what could happen that causes the gpu to silently produce incorrect results that requires a power cycle to fix?

lars · May 10, 2011, 5:42am

Unfortunately, I don’t have any answers, but can confirm that I see the same thing happening on multiple cards (GTX 470, GTX 580) in several of our GPU boxes under CUDA 3.2 / 260.19.44. After a random amount of usage, GPUs start producing wrong results or simply generating “launch failed” errors when invoking a kernel. A soft reset doesn’t help, but a full power cycle usually seems to make the cards work properly again for a while.

I was hoping that upgrading to CUDA 4.0 once it gets released would help, but apparently this problem is still present in 4.0RC2.

If this can’t be fixed, some kind of way to completely reset a GPU (i.e, something like nvidia-smi --reset-gpu -g 0) without power cycling the whole machine would be very useful in order to work around this.

/Lars

reficul · May 10, 2011, 10:23am

I can confirm this, too. Currentlty im trying to find a trigger which puts my card (GTX580) into this unpredictable state. All I can say for sure is, that tests like “scan” and “alignedTypes” fail, whereas other tests like the bandwith test or devicequery continue to function normally. Power cycling seems to have fixed the problem for a uncertain amount of time. I am using CUDA 4.0RC2 and the 270.40 driver.

Jimmy_Pettersson · May 10, 2011, 11:19am

EDIT: removed multiple posts

Jimmy_Pettersson · May 10, 2011, 11:19am

I’ve been getting irreproducible results on GT200 / Fermi on 3.2 for at least one compiler bug and another bug which is unclear wheter it is hardware / software. The first one has been submitted while the 2nd I still need to make a repro-case for.

1st bug: Be careful using “unsigned int” in addressing as the compiler on rare occasions make it negative in an intermediate step, hence workaround:

somePtr[threadIdx.x + blockIdx.x*asdfasdf ....] = ...  ;

-----workaround-----> 

int address = threadIdx.x + blockIdx.x*asdfasdf .... ; 

somePtr[address] = .... ;

2nd bug: Accessing constant memory space and using data as input for intrinsic functions such as __sinf() and __cosf() :

Hence this caused IRREPRODUCIBLE problems

constant float c_val[7];

…

float val = __sinf( c_val[k] * reg_val );

Workaround:

// Load data into register first and later use these in intrinsic function

float reg_c_val[7];

for(int i = 0; i < 7; i++)

reg_c_val[i] = c_val[i];

float val = __sinf(reg_c_val[k]*rg_val);

Above is the basic idea of bug #2 but it has so far been very hard to make a simple repro-case for so I will submit this soon when there is time to create a repro…

eelsen · May 11, 2011, 5:51pm

Well, it is good to know that I’m not the only one having this problem. Jimmy - are you saying that your code snippets can put the GPU into this state where almost any non-trivial program fails or produces incorrect results or that just those specific pieces of code don’t produce consistent results? Because I too can’t create a repro case where I can say do this and you will put your gpu in this state.

Can anyone from NVIDIA confirm they are aware of this? If at least 3-4 people have seen it, certainly there must be more. I’m running under linux64 for what its worth - is that the case for everyone having this problem?

Jimmy_Pettersson · May 11, 2011, 8:50pm

Yes this type of bug is frustrating indeed as they consume A LOT of time finding. Generally i trust the compiler and instead question my own code if there are erroneous results but bad compiler / hardware adds another dimension to the debugging :)

I’m not sure, I have yet to produce a trivial simple repro-case for this (mentioned bug #2 ) bug. It is not a program failure but rather inconsistent results lets say 50-200 elements out of 15 million will be computed incorrectly, more or less between runs.

Initially you would think that this is a race condition caused by for example poor synchronization, but in the above example replacing __sinf with sin() would solve the problem but would cause worse performance. Hence a work-around was needed…

eelsen · May 11, 2011, 9:16pm

Right - but if you run another program - say the reduction or scan examples from the SDK - do they work correctly? The problem the rest of us are having is that NOTHING works, even programs that should work, and even a soft reset doesn’t fix it - a complete power cycle is needed. The fact that even a soft reset doesn’t fix it suggests that it must be something actually ON the gpu that gets into a bad state and not the cpu side driver. It might be a bug in all of our software that puts the gpu into this state, but if it is, it is far from deterministic.

I should also add that this kind of error is really scary. Silently producing incorrect results is just about the worst possible thing that could ever happen. It would be better if kernel launches at least failed.

Jimmy_Pettersson · May 11, 2011, 9:32pm

Mine run correctly for SDK examples etc,. Yes, I noticed that we are talking about different issues. I just wanted to whine about other problems ;-)

eelsen · May 26, 2011, 1:02am

This has just happened to me again after upgrading to 4.0 Final. As usual, a soft reset doesn’t fix the problem. I’ve gone through the SDK examples to catalog which ones produce incorrect results (I’ve excluded ones that only produce images or use random numbers).

alignedTypes: uint8 passes, all other tests fail

BlackScholes: BlackScholes.cu(173) : cudaSafeCall() Runtime API error 4: unspecified launch failure.

fastWalshTransform: FAILED

FDTD3d: Data error at point (191,108,0) 29.173361 instead of 29.179611 (different runs have different locations of errors)

mergeSort: main.cpp(77) : cudaSafeCall() Runtime API error 4: unspecified launch failure.

radixSortThrust: FAILED

reduction: FAILED with a different GPU result produced each time (as described in my first post)

scan: all of the short arrays (<=1024 elements) pass, all of the large arrays (>=2048) FAIL

transpose: naive, coalesced, optimized and diagonal kernlels FAIL, others (simple, shared, coarse, fine) pass

I have a feeling that a deeper inspection of most of the floating point examples would reveal that they are also failing, but not being reported as such due to the EPS comparison.

brano · May 26, 2011, 7:23am

Hi,Could you please post the spec. of your system.
OS?
Motherboard?
CPU?
GPU?

DrAnderson42 · May 27, 2011, 12:59pm

The GPU getting into a bad state happens so often (once/week) on our development & test machine, that I installed a shell script sanity_check.sh that runs the sanity check test in cuda-memtest CUDA GPU memtest download | SourceForge.net . Any time that I get any kind of questionable behavior, I run sanity_check.sh and see if it passes or not. Sometimes a soft reset clears the problem and sometimes a cold boot is needed.

I’ve seen some really weird behavior on this machine. My favorite was when every single kernel launch ran and completed in ~10 microseconds, but did nothing, not even setting launch failure errors! I was running comparison benchmarks at the time, and was starting to get confused as to why code changes were not changing performance. It wasn’t until the next day when I logged in and ran a different benchmark and got something 10x faster than was normal that I realized something was up.

Another interesting variation on this problem shows up on our production S2050s (but not the S1070s). After running for a “while” (weeks or months), they get in a bad state and attempting to initialize the CUDA context puts the CUDA driver into an infinite loop. We see this with CUDA 3.2, but not 3.0 (haven’t upgraded these systems to 4.0 yet). A cold boot is the only thing that resolves this problem.

So yes, these types of problems do happen all the time to a lot of people and are very frustrating. As someone who has been running CUDA since version 0.8, all I can tell you is that things are a lot better now then they used to be. Back then, the driver was so fragile that a simple out of bounds memory write would cause pixels to speckle random colors on the screen and the driver to start doing very strange things. Debugging a CUDA program entailed rebooting after every couple test runs as you could never be sure if the results were incorrect because the GPU was in a bad state or if your code still had the bug.

cbfreder · June 2, 2011, 2:37pm

[font=Verdana]

[font=Verdana]I am having these problems too.[/font]

Palit GeForce GTX 560 Ti 2GB

Ubuntu 11.04

Tried devdriver 270.41.19 and regular 275.09

cuda_memtest (http://cudagpumemtest.sourceforge.net/) returns memory errors which are not repeatable. Several SDK sample codes work (eg. deviceQuery, nbody), others fail, and others fail and cause artifacts on the screen (blocks of altered pixels that stay until the screen is redrawn). fluidsGL is neat in that it works sometimes but other times the solution will go unstable. Some of the scan, reduction, etc. samples will almost work but fail with too high error. Using the devdriver resulted in many “unspecified launch failures”, but after switching to 275.90, I haven’t noticed that.

[/font]

mipko · August 21, 2011, 10:06pm

Hello guys,

Do you have any update on this ?

thanks
Mirko

Matthias_Maiterth · August 25, 2011, 3:12pm

Hey,

i had a similar problem and just found a workaround that works fine for me. Perhaps its usefull to some of you too.

My task is about calculating with a little pile of Data. The Problem was, that the first 600 Values were pretty random, the following 320000 Values were fine and again in the end 40000 random values were generated. (only the middle 320000 Values were allright and the same for every execution).

My initial values for spawning the threads was two dimensional. Naimly dim3 ThreadsPerBlock(16,16)
While tinkering with the block and threadsize if found out, that my values are only deterministic, when my threadsize is one-dimensional.
E.g. ThreadsPerBlock(512,1).

Hope this solution applies for the Problem mentioned above!

fhellman · August 31, 2011, 10:46am

I have experienced similar problems on both GTX570 and GTX580 cards

When a computer with any of the mentioned cards has been up for about 30 days, it starts generating incorrect results. The cuda-memtest and memtestG80 utilities both report a vast amount of memory errors.

When rebooting the computer softly (not power-cycling) the errors persist. Power-cycling is required to temporarily get rid of the errors.

The pattern is:
Card: GTX570, GTX850
Symptom: Incorect results after long uptime
Temporary fix: Power-cycle

This pattern has been recognized by many others and there is a thread on the subject on the EVGA forums: http://www.evga.com/forums/tm.aspx?m=811877

As far as I understand, NVIDIA has not yet commented the issue. With this error it is not possible to build reliable GPU-systems with modern NVIDIA GPU:s.

thomasco · August 31, 2011, 4:37pm

That’s interesting, you also get errors with the Teslas.

Do you run your Teslas with ECC on or off ?

Joky · August 31, 2011, 6:43pm

I do have exactly the same issue on a machine with a Tesla C2050 (ECC disabled) !

When running benchmarks for a long time, it happens to hang (process consume 100% CPU forever). In general soft reboot can be enough… if it works ! It uses to fails to soft shutdown because the driver hangs the kernel… A reboot can be needed every day if I use the GPU intensively.

CNugteren · September 8, 2011, 8:10am

We are having the same problem here.

We encountered this problem today, when we saw numerous SDK example applications failing. We tried running ‘matrixMul’, ‘histogram’ and ‘reduction’. Out of our 16 GPUs, we saw 5 failing devices. A soft reboot did not fix the problem, a hard reboot did. We have the system up and running for a little over one month now. The specifications:

4 machines with an ‘Asus P6T7 WS SuperComputer’ mainboard
4 GTX570 boards in each machine, 16 total
Each machine runs Ubuntu 10.04, with:
- CUDA version: 4.0 V0.2.1221
- Driver version: NVIDIA UNIX x86_64 Kernel Module 275.21

Our solution is to have a cronjob run every night that tests ‘matrixMul’ on all 16 GPUs. If it fails, an email will be send and we’ll have to reboot the machines manually.

It would be nice to know whether this is a software problem (cuda toolkit or driver) or hardware problem. Is there anything in common between all of us with this problem?

Jimmy_Pettersson · September 8, 2011, 12:44pm

There are known driver / software issues that cause irreproducible results but the problem that most people here seem to be having is a hardware problem, more specifically Fermi. The issue did not show up on GT200 GPUs but only on the GF100+ architectures.

It’s disturbing that it also appears on the professional cards… Lets pray for kepler :D