Deciphering an NVRM: Xid message?
  1 / 2    
One of our users is running a pretty simple CUDA kernel that very infrequently (perhaps 1 out of every 150 calls) fails with this error message:

Cuda error: rp_many_kernel in file 'rp_cuda.cu' in line 111 : the launch timed out and was terminated.

In addition, there is a message dumped to the syslog whenever this happens:

NVRM: Xid (0003:00): 8, Channel 00000001

Subsequent calls of the same program work fine (until much later when the bug hits again). The error is also non-deterministic: the same program with the same input file will work the next time we run it.

Can anyone explain what the Xid message means? This is not a showstopper for us, since we can discover the failure very easily and rerun, but I'm curious what the failure actually is.

(System info: Athlon64 X2 4600+, Leadtek 8800 GTX, Scientific Linux 4.3 which is same as RHEL4.3, kernel 2.6.9-22.0.1.ELsmp, Nvidia driver 169.09, CUDA Toolkit 1.1)
One of our users is running a pretty simple CUDA kernel that very infrequently (perhaps 1 out of every 150 calls) fails with this error message:



Cuda error: rp_many_kernel in file 'rp_cuda.cu' in line 111 : the launch timed out and was terminated.



In addition, there is a message dumped to the syslog whenever this happens:



NVRM: Xid (0003:00): 8, Channel 00000001



Subsequent calls of the same program work fine (until much later when the bug hits again). The error is also non-deterministic: the same program with the same input file will work the next time we run it.



Can anyone explain what the Xid message means? This is not a showstopper for us, since we can discover the failure very easily and rerun, but I'm curious what the failure actually is.



(System info: Athlon64 X2 4600+, Leadtek 8800 GTX, Scientific Linux 4.3 which is same as RHEL4.3, kernel 2.6.9-22.0.1.ELsmp, Nvidia driver 169.09, CUDA Toolkit 1.1)

#1
Posted 02/08/2008 04:37 PM   
Nearly all the Xid error codes are a class of problem, and do not point to a specific problem. Xid:8 generally means that there was some kind of timeout problem, which is unfortunately what you already know.

If you'd like further assistance, please attach a test app which reproduces the problem (along with build & run instructions).
Nearly all the Xid error codes are a class of problem, and do not point to a specific problem. Xid:8 generally means that there was some kind of timeout problem, which is unfortunately what you already know.



If you'd like further assistance, please attach a test app which reproduces the problem (along with build & run instructions).

#2
Posted 02/08/2008 04:43 PM   
[quote name='seibert' date='Feb 8 2008, 10:37 AM']One of our users is running a pretty simple CUDA kernel that very infrequently (perhaps 1 out of every 150 calls) fails with this error message:
[right][snapback]323033[/snapback][/right]
[/quote]

I have the same problem with one of my kernels. Can you post the code, so we can compare notes to see what is causing this? I've had a bug report filed with NVIDIA for quite a while on this one.

The following things seem to reduce or eliminate the problem (note: perhaps elimination means it just happens 1 in a million instead, I've only tested a few million kernel calls with the workarounds in place)

1. Read memory uncoalesced instead of with a texture (huge performance hit)
2. Run fewer blocks (not an option when the problem size is large)
3. Run larger blocks (performance hit for me where block sizes of 64 are optimal)

I'm attaching my minimal reproduction case.
[quote name='seibert' date='Feb 8 2008, 10:37 AM']One of our users is running a pretty simple CUDA kernel that very infrequently (perhaps 1 out of every 150 calls) fails with this error message:

[snapback]323033[/snapback]






I have the same problem with one of my kernels. Can you post the code, so we can compare notes to see what is causing this? I've had a bug report filed with NVIDIA for quite a while on this one.



The following things seem to reduce or eliminate the problem (note: perhaps elimination means it just happens 1 in a million instead, I've only tested a few million kernel calls with the workarounds in place)



1. Read memory uncoalesced instead of with a texture (huge performance hit)

2. Run fewer blocks (not an option when the problem size is large)

3. Run larger blocks (performance hit for me where block sizes of 64 are optimal)



I'm attaching my minimal reproduction case.

#3
Posted 02/08/2008 05:39 PM   
Ah, that's very interesting. Our kernel is very straightforward actually. The threads make coalesced reads from a C-style array in global memory, do some sincos() calls + some arithmetic, write to another global array in a coalesced way. The call configuration is slightly unusual in that the block size is 128 threads, but there are 12500 blocks.

I'll try to convert our code into a test case that loops until failure, and doesn't depend on external data files. We have a second card (same make and model) installed in a completely different computer with different CPU, motherboard, power supply, and kernel (equiv to RHEL 4.5) so I'll also try to reproduce there.
Ah, that's very interesting. Our kernel is very straightforward actually. The threads make coalesced reads from a C-style array in global memory, do some sincos() calls + some arithmetic, write to another global array in a coalesced way. The call configuration is slightly unusual in that the block size is 128 threads, but there are 12500 blocks.



I'll try to convert our code into a test case that loops until failure, and doesn't depend on external data files. We have a second card (same make and model) installed in a completely different computer with different CPU, motherboard, power supply, and kernel (equiv to RHEL 4.5) so I'll also try to reproduce there.

#4
Posted 02/08/2008 06:33 PM   
Quick followup: We've made the problem go away by inserting a "sleep 2" call between executions of our program. (The program makes only one CUDA call each time it runs.) This brings the time-averaged load on the computer down to 0.67. The CUDA program is so enormously faster than the CPU version that this is an acceptable workaround for us.

Once the card frees up, I'll be able to try generating a simple reproduction case (and testing MisterAnderson42's test case).
Quick followup: We've made the problem go away by inserting a "sleep 2" call between executions of our program. (The program makes only one CUDA call each time it runs.) This brings the time-averaged load on the computer down to 0.67. The CUDA program is so enormously faster than the CPU version that this is an acceptable workaround for us.



Once the card frees up, I'll be able to try generating a simple reproduction case (and testing MisterAnderson42's test case).

#5
Posted 02/08/2008 08:00 PM   
MisterAnderson42, how often did it go wrong for you?

I am going to be advising using CUDA in a real-time system, where failure is not really an option. I would like to make sure that in my testing I will encounter these cases if they happen for my kernels. 1 out of 150 calls would not be too bad to encounter it, but I guess I will have to try to do some 24h real-time testing before I can give my final advise (I am for now calling my code from MATLAB, so I am not really stressing the GPU...)

Also is anyone of you using the async API? I will have to perform some data-massaging on the CPU while the GPU is running, so will be using the async API. Maybe it helps to not be spinning the CPU waiting for the GPU to finish its work?
MisterAnderson42, how often did it go wrong for you?



I am going to be advising using CUDA in a real-time system, where failure is not really an option. I would like to make sure that in my testing I will encounter these cases if they happen for my kernels. 1 out of 150 calls would not be too bad to encounter it, but I guess I will have to try to do some 24h real-time testing before I can give my final advise (I am for now calling my code from MATLAB, so I am not really stressing the GPU...)



Also is anyone of you using the async API? I will have to perform some data-massaging on the CPU while the GPU is running, so will be using the async API. Maybe it helps to not be spinning the CPU waiting for the GPU to finish its work?

#6
Posted 02/08/2008 10:03 PM   
I would like to clarify that I only see this problem in one of my kernels. The other dozen or so I run never give me any trouble, so this is a subtle bug that is hard to trigger. Given that it is rarely mentioned on the forums, I'd guess kernels that trigger it are rare. And I've yet to find any pattern in those that do.

Now, when I have a kernel that does trigger it, I usually see the problem after anywhere from a few hundred calls up to 20,000. Occasionally I see up to 50,000. Kernels are being called about 200 times per second, so this problem shows up in a few seconds to a few minutes.
Edit: forgot to add that the numbers above are for windows. For whatever reason, the error almost always occurs after a few thousand calls in linux.

In my experience with this bug, you know it almost right away when you have it.

With the workarounds I mentioned in place, I've run successfully for several hours. I was probably being over-paranoid when I suggested that it might fail after millions of calls. I'm 99% confident that with the modified (slow) kernel, it will run for days. I'll give it a try :)

I'm not using the async API at all, the cpu is just spin-wating all the time. Maybe there is a connection there. Perhaps my workaround works solely because it slows down the rate of kernel calls. I'll play around with some adding some sleeps in my code to see what I can come up with.
I would like to clarify that I only see this problem in one of my kernels. The other dozen or so I run never give me any trouble, so this is a subtle bug that is hard to trigger. Given that it is rarely mentioned on the forums, I'd guess kernels that trigger it are rare. And I've yet to find any pattern in those that do.



Now, when I have a kernel that does trigger it, I usually see the problem after anywhere from a few hundred calls up to 20,000. Occasionally I see up to 50,000. Kernels are being called about 200 times per second, so this problem shows up in a few seconds to a few minutes.

Edit: forgot to add that the numbers above are for windows. For whatever reason, the error almost always occurs after a few thousand calls in linux.



In my experience with this bug, you know it almost right away when you have it.



With the workarounds I mentioned in place, I've run successfully for several hours. I was probably being over-paranoid when I suggested that it might fail after millions of calls. I'm 99% confident that with the modified (slow) kernel, it will run for days. I'll give it a try :)



I'm not using the async API at all, the cpu is just spin-wating all the time. Maybe there is a connection there. Perhaps my workaround works solely because it slows down the rate of kernel calls. I'll play around with some adding some sleeps in my code to see what I can come up with.

#7
Posted 02/08/2008 10:30 PM   
[quote name='MisterAnderson42' date='Feb 8 2008, 05:30 PM']I would like to clarify that I only see this problem in one of my kernels. The other dozen or so I run never give me any trouble, so this is a subtle bug that is hard to trigger. Given that it is rarely mentioned on the forums, I'd guess kernels that trigger it are rare. And I've yet to find any pattern in those that do.
[/quote]

Same here. My day-to-day workhorse program has a far more complex kernel that I call thousands (or even hundreds of thousands) of times, and I have only seen one launch failure ever that I couldn't trace to an actual bug in my code. (This was back in the CUDA 0.8 days)

The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue. The failing kernel, without sleep, normally runs the GPU at a much higher duty cycle than the code I usually run. I've known for a while that the computer case is slightly underpowered and under-ventilated for an 8800 GTX (we have very few PCI-Express machines), so my working hypothesis is heat, rather than some kind of CUDA driver bug. I need to figure out how to monitor the GPU temperature and see if it correlates with failures.
[quote name='MisterAnderson42' date='Feb 8 2008, 05:30 PM']I would like to clarify that I only see this problem in one of my kernels. The other dozen or so I run never give me any trouble, so this is a subtle bug that is hard to trigger. Given that it is rarely mentioned on the forums, I'd guess kernels that trigger it are rare. And I've yet to find any pattern in those that do.





Same here. My day-to-day workhorse program has a far more complex kernel that I call thousands (or even hundreds of thousands) of times, and I have only seen one launch failure ever that I couldn't trace to an actual bug in my code. (This was back in the CUDA 0.8 days)



The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue. The failing kernel, without sleep, normally runs the GPU at a much higher duty cycle than the code I usually run. I've known for a while that the computer case is slightly underpowered and under-ventilated for an 8800 GTX (we have very few PCI-Express machines), so my working hypothesis is heat, rather than some kind of CUDA driver bug. I need to figure out how to monitor the GPU temperature and see if it correlates with failures.

#8
Posted 02/09/2008 01:58 AM   
Okay, I guess that if I have this problem, it will surely show up when converting the calling application to C and running real-time.
Okay, I guess that if I have this problem, it will surely show up when converting the calling application to C and running real-time.

#9
Posted 02/09/2008 06:00 AM   
I only know Xid 13, which is like a segmentation fault (wrong memory access, wrong instruction, etc..). I've also seen 10 a few times, but never 8.

Still, that's the class of problems I'd look at: buffer overflows, out of bound memory access, non-aligned memory accesses, and so on.

Another reason for timeouts would be race conditions, a __syncthreads() that is not met by all of the threads, due to some condition.

These mistakes only sometimes cause a Xid error, other times the result is much more suble. So this would agree with your 'one in 10000' error rate.
I only know Xid 13, which is like a segmentation fault (wrong memory access, wrong instruction, etc..). I've also seen 10 a few times, but never 8.



Still, that's the class of problems I'd look at: buffer overflows, out of bound memory access, non-aligned memory accesses, and so on.



Another reason for timeouts would be race conditions, a __syncthreads() that is not met by all of the threads, due to some condition.



These mistakes only sometimes cause a Xid error, other times the result is much more suble. So this would agree with your 'one in 10000' error rate.

#10
Posted 02/09/2008 11:39 AM   
wumpus, if you are responding to the OP: I agree. I've seen out of bounds memory writes in particular cause unspecified launch failures in later kernel calls, and only a reboot would prevent the problem from occuring.

In my own kernel that has this issue, I have checked all those things 10 times over including running it in emulation mode through valgrind. I haven't been able to solve the bug since august '07.

If you see anything I missed in my minimal reproduction case, I'd be glad to hear it!
wumpus, if you are responding to the OP: I agree. I've seen out of bounds memory writes in particular cause unspecified launch failures in later kernel calls, and only a reboot would prevent the problem from occuring.



In my own kernel that has this issue, I have checked all those things 10 times over including running it in emulation mode through valgrind. I haven't been able to solve the bug since august '07.



If you see anything I missed in my minimal reproduction case, I'd be glad to hear it!

#11
Posted 02/09/2008 03:35 PM   
[quote name='seibert' date='Feb 8 2008, 07:58 PM']The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue.  The failing kernel, without sleep, normally runs the GPU at a much higher duty cycle than the code I usually run.  I've known for a while that the computer case is slightly underpowered and under-ventilated for an 8800 GTX (we have very few PCI-Express machines), so my working hypothesis is heat, rather than some kind of CUDA driver bug.  I need to figure out how to monitor the GPU temperature and see if it correlates with failures.
[right][snapback]323331[/snapback][/right]
[/quote]
That is something I haven't tried. Once I get a Tesla D870, I'll set it up in an air conditioned server room and see if the failure rates change.

For now, I've launched my app with the workaround in place. It's telling me it should finish in 266 hours. I'm on vacation for the next week, so we'll see how far it gets :)
[quote name='seibert' date='Feb 8 2008, 07:58 PM']The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue.  The failing kernel, without sleep, normally runs the GPU at a much higher duty cycle than the code I usually run.  I've known for a while that the computer case is slightly underpowered and under-ventilated for an 8800 GTX (we have very few PCI-Express machines), so my working hypothesis is heat, rather than some kind of CUDA driver bug.  I need to figure out how to monitor the GPU temperature and see if it correlates with failures.

[snapback]323331[/snapback]




That is something I haven't tried. Once I get a Tesla D870, I'll set it up in an air conditioned server room and see if the failure rates change.



For now, I've launched my app with the workaround in place. It's telling me it should finish in 266 hours. I'm on vacation for the next week, so we'll see how far it gets :)

#12
Posted 02/09/2008 03:58 PM   
Update: After 3 and 1/2 hours and 1.3 million iterations (about 6 different kernel calls per iteration) my long test with the workaround in place failed with unspecified launch failure in a completely different (and very simple!) kernel.

Maybe it is just a cooling/duty cycle issue. But that still doesn't explain one one particular kernel triggers the issue in seconds and another takes hours.
Update: After 3 and 1/2 hours and 1.3 million iterations (about 6 different kernel calls per iteration) my long test with the workaround in place failed with unspecified launch failure in a completely different (and very simple!) kernel.



Maybe it is just a cooling/duty cycle issue. But that still doesn't explain one one particular kernel triggers the issue in seconds and another takes hours.

#13
Posted 02/12/2008 12:30 AM   
If it's a hardware problem, one could easily imagine that the particular operation sequence in one kernel tickles things in a way that causes the problem to occur at a much higher incidence rate. In a similar way, there are code sequences for CPUs that have a higher likelihood of causing thermal problems in certain areas of the chip. I read a number of posts on one of the Folding @ Home forums saying that people had fried a number of GPUs by running F@H on them 24/7 for weeks on end. Granted, I suspect they had inadequate cooling. One of the big differences between the Tesla and Quadro cards and the other NVIDIA GPUs is that NVIDIA does the quality control on the Tesla and Quadro themselves, and they are subject to a much higher quality standard than the gaming cards are. I imagine the same is true in the case of the ATI/AMD cards, so the F@H anecdotes about frying cards by running 24/7 may just be symptomatic of whatever QC the ATI/AMD gaming cards had.

While we haven't had any trouble pounding the hell out of our NVIDIA game-oriented cards with CUDA, if we had a large number of them setup in a cluster, I imagine the MTBF on them might be shorter than for the Tesla/Quadro cards. We run our CUDA test machines in an air conditioned machine room, so this may contribute significantly to our reliability so far. If you guys have kernels that reliably cause problems on your own machines, I would be willing to try running them here and see if the same problems crop up on our machines.

Cheers,
John Stone


[quote name='MisterAnderson42' date='Feb 11 2008, 07:30 PM']Update: After 3 and 1/2 hours and 1.3 million iterations (about 6 different kernel calls per iteration) my long test with the workaround in place failed with unspecified launch failure in a completely different (and very simple!) kernel.

Maybe it is just a cooling/duty cycle issue. But that still doesn't explain one one particular kernel triggers the issue in seconds and another takes hours.
[right][snapback]324778[/snapback][/right]
[/quote]
If it's a hardware problem, one could easily imagine that the particular operation sequence in one kernel tickles things in a way that causes the problem to occur at a much higher incidence rate. In a similar way, there are code sequences for CPUs that have a higher likelihood of causing thermal problems in certain areas of the chip. I read a number of posts on one of the Folding @ Home forums saying that people had fried a number of GPUs by running F@H on them 24/7 for weeks on end. Granted, I suspect they had inadequate cooling. One of the big differences between the Tesla and Quadro cards and the other NVIDIA GPUs is that NVIDIA does the quality control on the Tesla and Quadro themselves, and they are subject to a much higher quality standard than the gaming cards are. I imagine the same is true in the case of the ATI/AMD cards, so the F@H anecdotes about frying cards by running 24/7 may just be symptomatic of whatever QC the ATI/AMD gaming cards had.



While we haven't had any trouble pounding the hell out of our NVIDIA game-oriented cards with CUDA, if we had a large number of them setup in a cluster, I imagine the MTBF on them might be shorter than for the Tesla/Quadro cards. We run our CUDA test machines in an air conditioned machine room, so this may contribute significantly to our reliability so far. If you guys have kernels that reliably cause problems on your own machines, I would be willing to try running them here and see if the same problems crop up on our machines.



Cheers,

John Stone





[quote name='MisterAnderson42' date='Feb 11 2008, 07:30 PM']Update: After 3 and 1/2 hours and 1.3 million iterations (about 6 different kernel calls per iteration) my long test with the workaround in place failed with unspecified launch failure in a completely different (and very simple!) kernel.



Maybe it is just a cooling/duty cycle issue. But that still doesn't explain one one particular kernel triggers the issue in seconds and another takes hours.

[snapback]324778[/snapback]


#14
Posted 02/12/2008 04:38 PM   
[quote name='seibert' date='Feb 9 2008, 01:58 AM']...The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue...I need to figure out how to monitor the GPU temperature and see if it correlates with failures....
[/quote]

What is a "normal" working temperature for the TESLA C870. I typically monitor it in Linux with the NVIDIA pannel (type nvidia-settings on a terminal to run it) but I don't think that the temperature refresh rate is very high.
Have you found out any better way of monitoring this?
[quote name='seibert' date='Feb 9 2008, 01:58 AM']...The fact that sleep fixes our problem is starting to make me suspect that this is a power and cooling issue...I need to figure out how to monitor the GPU temperature and see if it correlates with failures....





What is a "normal" working temperature for the TESLA C870. I typically monitor it in Linux with the NVIDIA pannel (type nvidia-settings on a terminal to run it) but I don't think that the temperature refresh rate is very high.

Have you found out any better way of monitoring this?

#15
Posted 05/21/2008 11:52 AM   
  1 / 2    
Scroll To Top