cuda-gdb Error: Failed to suspend device (dev=0, error=10).
Hello everyone,

I would like to get some high-level suggestions/hypothesis about
an odd problem I'm experiencing.

I have a program that is essentially a tree exploration, based on a
recursive call on the host side, where, at each call, a blocking kernel is launched.

The problem is that the execution aborts during a kernel call with
the generic unspecified failure message.

Unfortunately the error is non-reproducible, namely, it happens consistently but
every time at different time steps on 2 specific machines (after roughly 10K kernel calls).
The only case when the problem is absent, is when the kernels are launched with one single block.
The other machines I tested present no errors at all with every grid configuration.

The cuda-gdb reports a strange error message (10). I did not find any documentation about this,
but I believe it could be related more to the OS/HW than to the program itself.
The code is rather involved, but it was tested on different systems.
The partial executions are the same until the kernel crash, so I'm not suspecting a bug in the code
and cuda memcheck reports no errors.

In every machine the kernelExecTimeoutEnabled reads 0 and I compiled with -arch=sm_21.

The following 2 configurations cause the problem:
Red Hat Enterprise Linux Workstation release 6.2 (Santiago), 12 core Xeon e55645 at 2.4GHZ, 32GB RAM
Tesla C2075,
nvcc release 4.1, V0.2.1221

openSUSE 11.3 (x86_64), Host: 4 core Xeon e5405 a 2GHZ, 2GB ram
Quadro 4000
nvcc release 4.0, V0.2.1221

While every other system is ok (here an example):

Mandriva Linux release 2011.0 (Official) for x86_64, MD Opteron 270, 2.01GHz, RAM 4GB
GTS 450
nvcc release 4.0, V0.2.1221

Do you have any high-level suggestions?
Probably I'm missing some setup/configuration issues related to the OS/cards.

Do you have any details about the error message I get?

Thank you in advance for your comments,
Alessandro
Hello everyone,



I would like to get some high-level suggestions/hypothesis about

an odd problem I'm experiencing.



I have a program that is essentially a tree exploration, based on a

recursive call on the host side, where, at each call, a blocking kernel is launched.



The problem is that the execution aborts during a kernel call with

the generic unspecified failure message.



Unfortunately the error is non-reproducible, namely, it happens consistently but

every time at different time steps on 2 specific machines (after roughly 10K kernel calls).

The only case when the problem is absent, is when the kernels are launched with one single block.

The other machines I tested present no errors at all with every grid configuration.



The cuda-gdb reports a strange error message (10). I did not find any documentation about this,

but I believe it could be related more to the OS/HW than to the program itself.

The code is rather involved, but it was tested on different systems.

The partial executions are the same until the kernel crash, so I'm not suspecting a bug in the code

and cuda memcheck reports no errors.



In every machine the kernelExecTimeoutEnabled reads 0 and I compiled with -arch=sm_21.



The following 2 configurations cause the problem:

Red Hat Enterprise Linux Workstation release 6.2 (Santiago), 12 core Xeon e55645 at 2.4GHZ, 32GB RAM

Tesla C2075,

nvcc release 4.1, V0.2.1221



openSUSE 11.3 (x86_64), Host: 4 core Xeon e5405 a 2GHZ, 2GB ram

Quadro 4000

nvcc release 4.0, V0.2.1221



While every other system is ok (here an example):



Mandriva Linux release 2011.0 (Official) for x86_64, MD Opteron 270, 2.01GHz, RAM 4GB

GTS 450

nvcc release 4.0, V0.2.1221



Do you have any high-level suggestions?

Probably I'm missing some setup/configuration issues related to the OS/cards.



Do you have any details about the error message I get?



Thank you in advance for your comments,

Alessandro

#1
Posted 04/12/2012 11:10 AM   
Hello, I'm getting exactly the same error.
Memcheck says everything is ok, but cuda-gdb just can't finish running my program.

[Launch of CUDA Kernel 15 (migration_A2A_Kernelc<<<(157,1,1),(256,1,1)>>>) on Device 0]
Error: Failed to suspend device (dev=0, error=10).

System:
CentOS 6.2 2.6.32-220.2.1.el6.x86_64, Tesla 2075

SDK:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_Nov_17_17:38:12_PST_2011
Cuda compilation tools, release 4.1, V0.2.1221

Could it be something related to Linux kernel only, I can't test it on windows machine...

Any ideas what could be wrong?

M.
Hello, I'm getting exactly the same error.

Memcheck says everything is ok, but cuda-gdb just can't finish running my program.



[Launch of CUDA Kernel 15 (migration_A2A_Kernelc<<<(157,1,1),(256,1,1)>>>) on Device 0]

Error: Failed to suspend device (dev=0, error=10).



System:

CentOS 6.2 2.6.32-220.2.1.el6.x86_64, Tesla 2075



SDK:

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2011 NVIDIA Corporation

Built on Thu_Nov_17_17:38:12_PST_2011

Cuda compilation tools, release 4.1, V0.2.1221



Could it be something related to Linux kernel only, I can't test it on windows machine...



Any ideas what could be wrong?



M.

#2
Posted 04/13/2012 12:58 PM   
I got the same problem and I post here to bring this topic on the top of the list. And hopefully someone can give an answer
I got the same problem and I post here to bring this topic on the top of the list. And hopefully someone can give an answer

#3
Posted 04/24/2012 03:12 PM   
I am not sure of this now :
[i]I got this idea that it could be caused by a too long stay in the kernel (possible when there is a lot of compute to do).
Indeed, I had this problem and I changed a parameter to reduce a loop in my kernel to fix it.
If I don't run my program with gdb, it seems to be an infinite loop, but not any messages.[/i]

What do you think of this idea?

I also reboot the computer, so far I think that this is what corrected my (random) problem. But I don't know why
I am not sure of this now :

I got this idea that it could be caused by a too long stay in the kernel (possible when there is a lot of compute to do).

Indeed, I had this problem and I changed a parameter to reduce a loop in my kernel to fix it.

If I don't run my program with gdb, it seems to be an infinite loop, but not any messages.




What do you think of this idea?



I also reboot the computer, so far I think that this is what corrected my (random) problem. But I don't know why

#4
Posted 04/27/2012 01:48 PM   
[quote name='Dext' date='27 April 2012 - 02:48 PM' timestamp='1335534522' post='1401564']
I am not sure of this now :
[i]I got this idea that it could be caused by a too long stay in the kernel (possible when there is a lot of compute to do).
Indeed, I had this problem and I changed a parameter to reduce a loop in my kernel to fix it.
If I don't run my program with gdb, it seems to be an infinite loop, but not any messages.[/i]

What do you think of this idea?

I also reboot the computer, so far I think that this is what corrected my (random) problem. But I don't know why
[/quote]

In my case, I do have a while loop, but in normal conditions
this would iterate for a few times.
It looks like exactly one of the blocks that execute does not respond anymore
(tried with printf inside the kernel).
Also sometimes the problem arises before the while loop, between
kernel instantiation and execution (in the meanwhile the other blocks
terminate correctly).

I don't have any kernel timeout set, so I'm for a different hypothesis
than simple infinite loop.
Moreover the fact that sometimes I'm able to run the same kernel and
sometimes not (after thousands of previous calls),
let me think about some different problem, I try (but I'm really puzzled):

- many kernels are lanched and gdb does not acknowledge their termination.
I see that this is rather normal (especially if you put a cudaThreadSynchonize
or similar). However this may be a symptom in conjunction with some specific
hw related problem
- there is a stack related to GPU kernel launches that is underdimensioned
- the issue is not strictly related to the kernel thread activity/program
[quote name='Dext' date='27 April 2012 - 02:48 PM' timestamp='1335534522' post='1401564']

I am not sure of this now :

I got this idea that it could be caused by a too long stay in the kernel (possible when there is a lot of compute to do).

Indeed, I had this problem and I changed a parameter to reduce a loop in my kernel to fix it.

If I don't run my program with gdb, it seems to be an infinite loop, but not any messages.




What do you think of this idea?



I also reboot the computer, so far I think that this is what corrected my (random) problem. But I don't know why





In my case, I do have a while loop, but in normal conditions

this would iterate for a few times.

It looks like exactly one of the blocks that execute does not respond anymore

(tried with printf inside the kernel).

Also sometimes the problem arises before the while loop, between

kernel instantiation and execution (in the meanwhile the other blocks

terminate correctly).



I don't have any kernel timeout set, so I'm for a different hypothesis

than simple infinite loop.

Moreover the fact that sometimes I'm able to run the same kernel and

sometimes not (after thousands of previous calls),

let me think about some different problem, I try (but I'm really puzzled):



- many kernels are lanched and gdb does not acknowledge their termination.

I see that this is rather normal (especially if you put a cudaThreadSynchonize

or similar). However this may be a symptom in conjunction with some specific

hw related problem

- there is a stack related to GPU kernel launches that is underdimensioned

- the issue is not strictly related to the kernel thread activity/program

#5
Posted 04/27/2012 04:12 PM   
Scroll To Top