Program work only on one computer, why?

Hello, community!
I have a problem. And I’m in depression now…

I have a program with CUDA. This program using for some scientific calculation. On my home computer (Win7, GTX 750 Ti, CUDA 7.5) it works well, a can launch 8192 threads, and it is enought for me.

But this program working only on my computer! On my work computer (Win10, GTX 1060, CUDA 8.0 ; Win7, GTX 1060, CUDA 7.5) this program can work only a few seconds, and after that it craches with ‘unspecified launch failure’ or memory errors, and program pointed to ‘cudaMemcpy(DeviceToHost)’ function. But I can succesfully launch the program with 512 threads…

If I start my program on my home computer with GTX 1060 from work computer, it works very well, I can even launch 32 768 threads!

I tried compile project on work computer, with CUDA 8.0 and 7.5 - nothing changes. All settings are equal. All software are equal. What could be the problem? Could it be that my home computer bewitched by my fairy godmother?

The first thing you would want to do is to implement proper CUDA error checking in this program. Check the status of every CUDA API call and every kernel launch. Also, check the status of every host library call such as malloc().

In addition, try running your program under the control of cuda-memcheck, which can find various out-of-bounds issues and some kinds of race condition on the GPU.

Thanks for so fast answer!

I already check all CUDA functions call, and find that all errors vanished if I eliminate string in which I call cudaMemcpy(HostToDevice) for one of several other variables, all this variables have type ‘double’ and aren’t arrays. That variable no different from the rest. I don’t understand this situation.

I tried CUDA-memcheck, but don’t remember result…I will try it again and will write about it.

Are you able to successfully build and run the example codes that come with CUDA on both computers?

Without seeing the code, it is not possible to give other than very general advice. Use standard debugging techniques (code bisection etc). The root cause could even be in the host code, causing bad data to be passed to the GPU. Unfortunately there isn’t a valgrind implementation for Windows, best I know; otherwise that would be a good tool to try as well.

Make sure to check for kernel status properly, that is, both synchronous and asynchronous errors. E.g.:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

Hello again!

So, I found the problem… In my program I use step-variable of double type which value 0.01. If I define this variable as 0.1 or 0.05 or 0.03 or 0.025 - the program works well, but if I set less value - unspecified launch failure. Why on my home computer this program can work with value of the variable as 0.01, but can’t work on work computer with this value? I understand that it’s some unknown to me errors in my code, but why this code can be launched on one computer successfully and can’t be launched on another?

Maybe your home computer has the WDDM TDR timeout disabled (or lengthened) and the work computer does not. Perhaps you are just hitting a WDDM TDR timeout.

Thanks for the replies!

I check TDR, set it as 0(as I understand, I turned it off), and then program started to crash with error ‘the launch timed out and was terminated’. The program can work a few seconds, as I see, and then crashing with this error…But if I use a smaller number of threads, program works well…

If a program, on windows, crashes with this error:

the launch timed out and was terminated

then you have definitely not disabled the TDR mechanism.

Sounds to me like the program is violating the hardware constraints, in other words, the compute capability constraints.

Each nvidia graphics card has different compute capabilities, cuda programs will have to adept to these constraints otherwise the launch will fail.

In case both cards have the same compute capability then you can probably safely ignore what I wrote, otherwise I would investigate this if I were you ! ;) :)

Txbob has already pinpointed the problem correctly.
No need to send off the thread opener in a wrong direction.

This is not certain, where is the error message coming from ?

There is also evidence for the opposite/my hypothesis:

"But if I use a smaller number of threads, program works well… "

This could indicate that too many threads were being used, for example passed the max threads constraint.

However this should produce an error message like: “the kernel failed to launch”.

Thanks for the replies!

Well, I increase TDR, and use not big amount of threads, and program work. So, as far as I understand, the problem was in TDR. I think, the problem was solved, thanks to all!

Skybuck, I don’t think that the problem in different compute capability, because my GTX 750 Ti have 5.0 capability, and GTX 1060 have 6.0. But program works well on 750 Ti=)