Hello everyone,
I would like to get some high-level suggestions/hypothesis about
an odd problem I’m experiencing.
I have a program that is essentially a tree exploration, based on a
recursive call on the host side, where, at each call, a blocking kernel is launched.
The problem is that the execution aborts during a kernel call with
the generic unspecified failure message.
Unfortunately the error is non-reproducible, namely, it happens consistently but
every time at different time steps on 2 specific machines (after roughly 10K kernel calls).
The only case when the problem is absent, is when the kernels are launched with one single block.
The other machines I tested present no errors at all with every grid configuration.
The cuda-gdb reports a strange error message (10). I did not find any documentation about this,
but I believe it could be related more to the OS/HW than to the program itself.
The code is rather involved, but it was tested on different systems.
The partial executions are the same until the kernel crash, so I’m not suspecting a bug in the code
and cuda memcheck reports no errors.
In every machine the kernelExecTimeoutEnabled reads 0 and I compiled with -arch=sm_21.
The following 2 configurations cause the problem:
Red Hat Enterprise Linux Workstation release 6.2 (Santiago), 12 core Xeon e55645 at 2.4GHZ, 32GB RAM
Tesla C2075,
nvcc release 4.1, V0.2.1221
openSUSE 11.3 (x86_64), Host: 4 core Xeon e5405 a 2GHZ, 2GB ram
Quadro 4000
nvcc release 4.0, V0.2.1221
While every other system is ok (here an example):
Mandriva Linux release 2011.0 (Official) for x86_64, MD Opteron 270, 2.01GHz, RAM 4GB
GTS 450
nvcc release 4.0, V0.2.1221
Do you have any high-level suggestions?
Probably I’m missing some setup/configuration issues related to the OS/cards.
Do you have any details about the error message I get?
Thank you in advance for your comments,
Alessandro