Deciphering an NVRM: Xid message?

seibert · February 8, 2008, 4:37pm

One of our users is running a pretty simple CUDA kernel that very infrequently (perhaps 1 out of every 150 calls) fails with this error message:

Cuda error: rp_many_kernel in file ‘rp_cuda.cu’ in line 111 : the launch timed out and was terminated.

In addition, there is a message dumped to the syslog whenever this happens:

NVRM: Xid (0003:00): 8, Channel 00000001

Subsequent calls of the same program work fine (until much later when the bug hits again). The error is also non-deterministic: the same program with the same input file will work the next time we run it.

Can anyone explain what the Xid message means? This is not a showstopper for us, since we can discover the failure very easily and rerun, but I’m curious what the failure actually is.

(System info: Athlon64 X2 4600+, Leadtek 8800 GTX, Scientific Linux 4.3 which is same as RHEL4.3, kernel 2.6.9-22.0.1.ELsmp, Nvidia driver 169.09, CUDA Toolkit 1.1)

netllama · February 8, 2008, 4:43pm

Nearly all the Xid error codes are a class of problem, and do not point to a specific problem. Xid:8 generally means that there was some kind of timeout problem, which is unfortunately what you already know.

If you’d like further assistance, please attach a test app which reproduces the problem (along with build & run instructions).

MisterAnderson42 · February 8, 2008, 5:39pm

I have the same problem with one of my kernels. Can you post the code, so we can compare notes to see what is causing this? I’ve had a bug report filed with NVIDIA for quite a while on this one.

The following things seem to reduce or eliminate the problem (note: perhaps elimination means it just happens 1 in a million instead, I’ve only tested a few million kernel calls with the workarounds in place)

Read memory uncoalesced instead of with a texture (huge performance hit)
Run fewer blocks (not an option when the problem size is large)
Run larger blocks (performance hit for me where block sizes of 64 are optimal)

I’m attaching my minimal reproduction case.

seibert · February 8, 2008, 6:33pm

Ah, that’s very interesting. Our kernel is very straightforward actually. The threads make coalesced reads from a C-style array in global memory, do some sincos() calls + some arithmetic, write to another global array in a coalesced way. The call configuration is slightly unusual in that the block size is 128 threads, but there are 12500 blocks.

I’ll try to convert our code into a test case that loops until failure, and doesn’t depend on external data files. We have a second card (same make and model) installed in a completely different computer with different CPU, motherboard, power supply, and kernel (equiv to RHEL 4.5) so I’ll also try to reproduce there.

seibert · February 8, 2008, 8:00pm

Quick followup: We’ve made the problem go away by inserting a “sleep 2” call between executions of our program. (The program makes only one CUDA call each time it runs.) This brings the time-averaged load on the computer down to 0.67. The CUDA program is so enormously faster than the CPU version that this is an acceptable workaround for us.

Once the card frees up, I’ll be able to try generating a simple reproduction case (and testing MisterAnderson42’s test case).

DenisR · February 8, 2008, 10:03pm

MisterAnderson42, how often did it go wrong for you?

I am going to be advising using CUDA in a real-time system, where failure is not really an option. I would like to make sure that in my testing I will encounter these cases if they happen for my kernels. 1 out of 150 calls would not be too bad to encounter it, but I guess I will have to try to do some 24h real-time testing before I can give my final advise (I am for now calling my code from MATLAB, so I am not really stressing the GPU…)

Also is anyone of you using the async API? I will have to perform some data-massaging on the CPU while the GPU is running, so will be using the async API. Maybe it helps to not be spinning the CPU waiting for the GPU to finish its work?

MisterAnderson42 · February 8, 2008, 10:30pm

I would like to clarify that I only see this problem in one of my kernels. The other dozen or so I run never give me any trouble, so this is a subtle bug that is hard to trigger. Given that it is rarely mentioned on the forums, I’d guess kernels that trigger it are rare. And I’ve yet to find any pattern in those that do.

Now, when I have a kernel that does trigger it, I usually see the problem after anywhere from a few hundred calls up to 20,000. Occasionally I see up to 50,000. Kernels are being called about 200 times per second, so this problem shows up in a few seconds to a few minutes.
Edit: forgot to add that the numbers above are for windows. For whatever reason, the error almost always occurs after a few thousand calls in linux.

In my experience with this bug, you know it almost right away when you have it.

With the workarounds I mentioned in place, I’ve run successfully for several hours. I was probably being over-paranoid when I suggested that it might fail after millions of calls. I’m 99% confident that with the modified (slow) kernel, it will run for days. I’ll give it a try :)

I’m not using the async API at all, the cpu is just spin-wating all the time. Maybe there is a connection there. Perhaps my workaround works solely because it slows down the rate of kernel calls. I’ll play around with some adding some sleeps in my code to see what I can come up with.

seibert · February 9, 2008, 1:58am

Same here. My day-to-day workhorse program has a far more complex kernel that I call thousands (or even hundreds of thousands) of times, and I have only seen one launch failure ever that I couldn’t trace to an actual bug in my code. (This was back in the CUDA 0.8 days)

The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue. The failing kernel, without sleep, normally runs the GPU at a much higher duty cycle than the code I usually run. I’ve known for a while that the computer case is slightly underpowered and under-ventilated for an 8800 GTX (we have very few PCI-Express machines), so my working hypothesis is heat, rather than some kind of CUDA driver bug. I need to figure out how to monitor the GPU temperature and see if it correlates with failures.

DenisR · February 9, 2008, 6:00am

Okay, I guess that if I have this problem, it will surely show up when converting the calling application to C and running real-time.

wumpus · February 9, 2008, 11:39am

I only know Xid 13, which is like a segmentation fault (wrong memory access, wrong instruction, etc…). I’ve also seen 10 a few times, but never 8.

Still, that’s the class of problems I’d look at: buffer overflows, out of bound memory access, non-aligned memory accesses, and so on.

Another reason for timeouts would be race conditions, a __syncthreads() that is not met by all of the threads, due to some condition.

These mistakes only sometimes cause a Xid error, other times the result is much more suble. So this would agree with your ‘one in 10000’ error rate.

MisterAnderson42 · February 9, 2008, 3:35pm

wumpus, if you are responding to the OP: I agree. I’ve seen out of bounds memory writes in particular cause unspecified launch failures in later kernel calls, and only a reboot would prevent the problem from occuring.

In my own kernel that has this issue, I have checked all those things 10 times over including running it in emulation mode through valgrind. I haven’t been able to solve the bug since august '07.

If you see anything I missed in my minimal reproduction case, I’d be glad to hear it!

MisterAnderson42 · February 9, 2008, 3:58pm

That is something I haven’t tried. Once I get a Tesla D870, I’ll set it up in an air conditioned server room and see if the failure rates change.

For now, I’ve launched my app with the workaround in place. It’s telling me it should finish in 266 hours. I’m on vacation for the next week, so we’ll see how far it gets :)

MisterAnderson42 · February 12, 2008, 12:30am

Update: After 3 and 1/2 hours and 1.3 million iterations (about 6 different kernel calls per iteration) my long test with the workaround in place failed with unspecified launch failure in a completely different (and very simple!) kernel.

Maybe it is just a cooling/duty cycle issue. But that still doesn’t explain one one particular kernel triggers the issue in seconds and another takes hours.

tachyon_john · February 12, 2008, 4:38pm

If it’s a hardware problem, one could easily imagine that the particular operation sequence in one kernel tickles things in a way that causes the problem to occur at a much higher incidence rate. In a similar way, there are code sequences for CPUs that have a higher likelihood of causing thermal problems in certain areas of the chip. I read a number of posts on one of the Folding @ Home forums saying that people had fried a number of GPUs by running F@H on them 24/7 for weeks on end. Granted, I suspect they had inadequate cooling. One of the big differences between the Tesla and Quadro cards and the other NVIDIA GPUs is that NVIDIA does the quality control on the Tesla and Quadro themselves, and they are subject to a much higher quality standard than the gaming cards are. I imagine the same is true in the case of the ATI/AMD cards, so the F@H anecdotes about frying cards by running 24/7 may just be symptomatic of whatever QC the ATI/AMD gaming cards had.

While we haven’t had any trouble pounding the hell out of our NVIDIA game-oriented cards with CUDA, if we had a large number of them setup in a cluster, I imagine the MTBF on them might be shorter than for the Tesla/Quadro cards. We run our CUDA test machines in an air conditioned machine room, so this may contribute significantly to our reliability so far. If you guys have kernels that reliably cause problems on your own machines, I would be willing to try running them here and see if the same problems crop up on our machines.

Cheers,

John Stone

Alvarito · May 21, 2008, 11:52am

What is a “normal” working temperature for the TESLA C870. I typically monitor it in Linux with the NVIDIA pannel (type nvidia-settings on a terminal to run it) but I don’t think that the temperature refresh rate is very high.

Have you found out any better way of monitoring this?

MisterAnderson42 · May 21, 2008, 1:35pm

If you run the beta 17x series of drivers, you can run nvidia-smi to log the temperatures to a file. It technically only supports the S870, but has been reported to work on a D870 (I’ve tested this) and on normal 8800 GTX cards. In my monitoring of the D870 box, “normal” temperatures seem to be ~73-76 C when the GPU is under load.

As an update on this old thread for new readers who may get the idea that CUDA is unstable: my kernel crashing problems have been resolved by making very slight code changes to the problem kernels. I’ve tested 150+ hours of constant run time, making ~400 million kernel calls with no problems on the D870 (at the temps mentioned above). The same app in my severely under cooled workstation runs the 8800 GTX up to 80C and the app crashes randomly with “unspecified launch failure” after a only a few hours of runtime. Opening the computer’s case to allow more airflow seems to help.

seibert · May 21, 2008, 4:20pm

Oh hey, so we can monitor the GTX the same way? That’s handy. I’m running the CUDA 2.0 beta drivers. Which package contains the nvidia-smi program?

MisterAnderson42 · May 21, 2008, 4:26pm

I’ve never tried it with an 8800 GTX, but I recall reading on some game enthusiast website website that it worked for them (I forget which site). They were just exploring the new stuff in the latest beta drivers.

nvidia-smi was installed with the 171.06 beta drivers (CUDA 1.1 capable). I haven’t tried 173.08.

Alghtouh I just checked on my CUDA 2.0 test system and it appears that the CUDA 2.0 beta drivers don’t include nvidia-smi :(

Alvarito · May 29, 2008, 3:23pm

Can you tell us what are those slight changes?

How can I get such messages (“unspecified launch failure”) when a kernel call fails? I typically enclose the call with CUDA_SAFE_CALL in Debug mode but get no message at all when the call fails.

MisterAnderson42 · May 29, 2008, 4:40pm

The original crashing code had

pidx = blockDim.x * blockIdx.x + threadIdx.x;

for (int i = 0; i < num[pidx]; i++)

    {

    ... do stuff ...

    }

I observed crashing when num[pidx] differed from thread to thread but not so much when num[pidx] was constant. So I changed it to

int n = num[pidx];

for (int i = 0; i < maximum_num; i++)

   {

   if (i < n)

      {

      ... do stuff ...

      }

   }

It takes a slight performance hit (20%), but will run for 100+ million calls without crashing.

To get error messages after a kernel call, you need to cudaThreadSynchronize() and then check the results of cudaGetLastError(). CUT_CHECK_ERROR might do this for you, but double check it since people on the forums have had troubles with it in the past.