Strange crashes in __device__ function

We’ve been experiencing some rather strange device function crashes, and I’m hoping that someone might have an idea of what the problem is. We’ve got a pretty complex application, and I’ll try to boil things down to their essence.

First off, we running Cuda compilation tools, release 7.5, V7.5.17. A very simplified version of the device function we have is as follows:

__device__ void func(structOfArrayType* aStruct, const uint32_t index, const uint64_t var, const float deltaT)
{
    uint64_t &oneVal = aStruct->memArray1[index];
    float &anotherVal = aStruct->memArray2[index];
    
    float yetAnother = aStruct->memArray3[index];
    .
    .
    .

   // And then some computation involving the above variables/references.
}

Initially, we called this function via a function pointer from a kernel (because we need to switch which function gets called at run time). We based this code on the NVIDIA function pointer example code. The kernel (and the code that mallocs and copies all of the device side data) is in one .cu file; the device function is in another. Using the CUDA 6 toolkit on an M2090, this runs fine. Using the 7.5 tools on a K80, it causes “invalid global read”, “address out of bounds” errors. We tried the following:

  1. We understand from the forum that "cross CUmodule functions cannot be reliably called using function pointers". So, we tried calling the device function directly. Still crashed.
  2. We Combined the kernel and device functions into a single .cu file; the cudaMalloc still in the second .cu. Called device function via a pointer (now within the same CUmodule). Still crashes.
  3. Called the function directly. Still crashes.
  4. Updated our machine with the M2090 to release 7.5 tools (and driver). Now our program crashes on it.

And now, for a final few bits of strangeness:

  1. If we add an assert at the start of the device function, something like:
    assert( index < aStruct->scalarMember * aStruct->anotherScalarMember );
    

    the program doesn’t crash.

  2. If we compile with the nvcc -G option, it doesn't crash.
  3. If we change the floats above to doubles, it doesn't crash.
  4. If we remove the references above, instead copying values to local variables and then copying back to global memory at the end of the computation, it doesn't crash.

So, does anyone have an idea of what’s going on here? At this point, we’ve pretty much tried systematically eliminating each thing we think might be causing the problem, with no luck. And, of course, other changes that shouldn’t matter fix the problem. Also, we’re doing pretty much exactly the same thing in other code in this program, without any trouble. Thanks!!

  1. Somewhere in the above sequence, you reduced everything to a single file, that still crashes.

Keep removing things that don’t seem to be necessary.

Try to reduce that file to less than 400 or maybe less than 100 lines of code that still crashes. Then post that file, I’ll bet someone can help you.

  1. Try CUDA 8.0RC. bugs get fixed all the time

  2. You’ve already started down the path of debugging with cuda-memcheck, based on this:

“it causes “invalid global read”, “address out of bounds” errors.”

If you compile with -lineinfo, you can narrow this down to a single line of code that is causing that failure. Once you know that line of code, you can use in-kernel printf to inspect the indexing that is causing the out-of-bounds access. lineinfo gives you the tail of the tiger. Hang onto that tail tightly, and it will lead you to the fault.

These are all very good suggestions. We actually know what line the crash is on. Unfortunately, it’s just a pretty vanilla arithmetic computation. If we retain the references, we get an invalid read, indicating that there’s an issue reading one of the operands. If we remove the references, instead copying items in global memory to local vars (register memory), then we get an invalid write upon saving the result back to global memory. Importantly, this global memory location was a location that we read one of the operands from – but the read didn’t trigger an error.

In addition, the assert() statement doesn’t involve changing any of the other code – exactly the same code then will generate no run-time errors.

All of this leads to the following conclusions:

  1. The problem isn’t an array index out-of-bounds condition. None of the above changes the index or the array, and we can make the error happen or not without changing either (and, sometimes it will not happen for some accesses and then happen for others).
  2. This looks very much like a compiler/translator/driver bug. Examination of PTX code might allow us to rule out a compiler bug, but it’s not clear that that would have any practical benefit.
  3. If code were being overwritten, I could imagine all kind of weird things going on. But, I assume that it’s not possible to overwrite GPU code from the GPU.

Comments? I don’t think it makes sense to move to non-released code on a production machine, so it seems like the next step would be to down-grade to CUDA 6, which worked fine on our M2090, assuming CUDA 6 supports a K80.

It’s quite possibly a compiler bug. The suggestion to try to reduce it down to a manageable size was with an eye towards enabling you to file a bug, if you wished to.

I wouldn’t spend any time analyzing PTX, as that is not necessarily representative of what the machine actually executes. If there is a compiler issue, it should presumably be possible to discover it with examination of the SASS code, but that could be an exercise that is not for the faint of heart, depending on the level to which you were able to reduce the scope of the problem (as discussed above).

Did you try CUDA 8.0RC? I’m not suggesting you have to move production machines to it, but I think it would be a useful datapoint. If by chance the problem is rectified there, then there would be little point filing a bug. The QA group tests bugs against latest tools, and if the issue does not reproduce, nobody spends any time on it, except to say “use the latest tools”.

From the totality of the symptoms described, it sounds like a compiler bug may be in play here. You may also want to check for undefined, or implementation-defined, C/C++ behavior in the code, as that can be the cause of latent bugs that may then be exposed by compiler changes.

For a quick experiment, and potential workaround while you wait for resolution of your bug report with NVIDIA, I would suggest reducing the PTXAS optimization level. The default is -O3. Try to reducing it to a less aggressive setting with -Xptxas -O2, then -Xptxas -O1 if that does not help, finally -Xptxas -O0. If that makes the issue disappear, it usually does so with only a modest loss of performance, as all the high-level optimizations are still applied by NVVM.