Errors and Lockups

bdg146psu · September 17, 2008, 2:20pm

System Specs:
Fedora 9 32-bit
Kernel 2.6.25.14-108
EVGA GTX 280
Driver 177.67
CUDA v2.0

I’ve been getting some unpredictable behavior from some of my programs written in CUDA. In some cases, while running a program using CUDA, xorg will all of a sudden begin taking up lots of CPU time. This often results in a lockup of X, which requires a reboot.

One specific incident has occurred lately with cufft that has me really puzzled. I’m attempting to do about 50,000 1024-point complex FFTs in batch mode. Perhaps this is too many, but I haven’t been able to find any definite maximum restriction on fftsize or the number of ffts. The strange thing is that if I run two of these back to back, the first seems to execute fine, with the second iteration giving me the following errors:
cufft: ERROR: root/cuda-stuff/sw/rel/gpgpu/r2.0/cufft/src/execute.cu, line 1038
cufft: ERROR: CUFFT_EXEC_FAILED
cufft: ERROR: root/cuda-stuff/sw/rel/gpgpu/r2.0/cufft/src/cufft.cu, line 119
cufft: ERROR: CUFFT_EXEC_FAILED

I attempted to put four simple 'printf’s in my code to see whether it is the cufft plan creation, the cufft execution, or the cufft plan destruction that is causing the errors. When these printfs are place in my code, X locks up and I must reboot. How can placing these printfs change the behavior from the above error messages to X locking up and needing a reboot?

This unpredictability has me pretty confused and it’s making me wonder if I have some sort of driver issue.

Any help is greatly appreciated,
thanks!

netllama · September 17, 2008, 2:34pm

Please note that Fedora9 is not currently supported (it will be supported in the next CUDA release).

That said, this could be an X bug, or a bug in places outside of the CUDA driver. You should first verify whether this problem persists if X isn’t running.

If it does, you should also verify that you’re using the latest motherboard BIOS, and generate and attach an nvidia-bug-report.log along with a test app which reproduces the problem.

thanks,
Lonni

tmurray · September 17, 2008, 3:05pm

Is this a factory overclocked GTX 280, by any chance?

bdg146psu · September 17, 2008, 3:09pm

No it isn’t. It’s running at the standard clock rate.

Running without X doesn’t seem to alleviate any of the issues.

I will look into the BIOS version and see if I can get a test app together to send over.

Thanks for the replies!

bdg146psu · September 18, 2008, 8:04pm

I’m having some more trouble with the CUFFT library today.

I realize that Fedora 9 isn’t supported, but I’ve read of others having success with it, so perhaps this is being cause by something else.

I was originally trying to create an C2C FFT plan for a 16384-sized FFT. I’m doing 68 of these in a batch.

This is all pre-existing code that was working yesterday. I simply changed the interface to it, so not much should’ve changed. However, I was getting a host of errors: one ‘CUFFT_ALLOC_FAILED’ followed by two ‘CUFFT_INVALID_PLAN’ errors. That led me to think that the plan was the portion that was causing the error. I’ve commented out all of the cufft stuff besides the declaration of the plan and the line that initializes the plan.

cufftHandle plan;
CUFFT_SAFE_CALL( cufftPlan1d(&plan, fftsize, CUFFT_C2C, numFFTs) );

where fftsize = 16384 and numFFTs = 68.
This still results in an error. Only one error though, the ‘CUFFT_ALLOC_FAILED’ error, which is somewhat expected.

I realize that by declaring and initializing a plan, memory is being allocated. However, I should be using less than 9 MB of data, so I find it hard to believe it’s running out of space.

Any idea what is going on here? Is there a function call that will display the amount of memory allocated or free on the device at run time? I’m wondering if I have a memory leak somewhere. I get the same errors whether X is running or not. Any ideas? Thanks!

oh, and where is the nvidia-bug-report.log file located? I couldn’t find it… or do I have to generate it somehow?

bdg146psu · September 18, 2008, 8:22pm

ok, well I found cuMemGetInfo(), but I’m apparently missing an include somewhere or something, because it’s saying that cuMemGetInfo is undefined. What do I need to include?

EDIT: I’ve included <cuda.h> which seems to be the ticket. It recognized I was passing it regular ‘int’ pointers instead of ‘unsigned int’, so I know it’s seeing the function definition.

The problem is that it’s giving me “undefined reference” errors in the .ii files created when I make my program. It gives me an undefined reference for cuCtxCreate, cuDeviceGet, and cuMemInfoGet. I am following the example here: [url=“http://forums.nvidia.com/index.php?showtopic=60073”]http://forums.nvidia.com/index.php?showtopic=60073[/url]

bdg146psu · September 19, 2008, 2:41pm

I’ve narrowed the problem a bit. For some reason, when fftsize > 4k, I get the CUFFT_ALLOC_FAILED error, using both pinned and unpinned memory.

Why is it not allowing me to create a cufft plan with greater than a 4096-point fft? Even if I am only doing a few ffts in batch, it still throws an error.

Could this really be an OS issue? How about cuMemGetInfo? I still can’t figure out how to use it, but I think it would be useful to see how much free memory is available, since the error I’m getting implies there is insufficient resources available to create that particular sized fft.

bdg146psu · September 19, 2008, 6:48pm

Well, despite the fact that I’m not getting any replies, I may as well update this thread as it may help someone in the future. The error had nothing to do with cufft. I added the following code after every line that uses the device:

cudaError_t err;

err = cudaGetLastError();
if (cudaSuccess != err)
{
//handle error
}

This showed that the error was occurring before the cufft plan was being initialized. It was giving an “unspecified launch failure” for a cudaThreadSynchonize() call. That still didn’t make much sense to me, but at least it moved my focus away from cufft.

It turns out an index within the kernel I had written was wrong. I was calculating the index to an array by using [base + (offset/2)]. Well, I had made changes to the code to make it [base + offset] instead, but I missed one. One index still remained [base + (offset/2)]. The problem is I was testing my code with fftsize < 4k, so the error didn’t appear until much later when I tried an fftsize of > 4k. It is still a bit confusing though, because (offset/2) < offset, so therefore [base+(offset/2)] should have remained within the bounds of the array, so I don’t understand why this was causing problems. It should be mentioned I was using this index for purposes of a read, and not a write. It confuses me why this caused problems, but it did, and it’s fixed, and it works.

At any rate, I do recommend the above code for checking errors for CUDA calls. It may not have attributed the error to my kernel, but at least it was closer to pinpointing the problem than cufft’s default output.