CUDA debugging and profiling problems

I am developing a CUDA program and have faced with many problems with it. Starting from most important.

  1. Nsight / Start CUDA debugging doesn’t work. Debugger doesn’t stop on breakpoints in CUDA kernels. I got messages in Output like
    “CUDA context created : 23ecbd80090
    CUDA module loaded: 23ed75fcf00 cudaDe.cu.obj
    CUDA grid launch failed: CUcontext: 2468731158672 CUmodule: 2468924608256 Function: _Z15initCalculationP14TDistrictState
    CUDART error: cudaLaunch returned cudaErrorLaunchFailure
    CUDART error: cudaMemcpy returned cudaErrorLaunchFailure”

  2. NVIDIA Visual Profiler fails to get advanced analysis information about my kernels. Errors are usually “unspecified”, “insufficient data” and so on Screenshot by Lightshot . The best I was able to get are some individual results when I made executable exiting after some iteration. Execution timeout in Profiler settings, pressing cancel – any of them breaks such data getting.
    cudaDeviceSynchronize();
    cudaProfilerStop();
    doesn’t help. Debug/release doesn’t make much difference.

  3. If I start second instant of the program, host memory usage increases a lot and program starts consuming CPU. E.g. instead of 500 MB in becomes 10 GB for instance 1 and 18 GB for the instance 2 (why 18? not 8 or 16). Consuming here means about 100% of one core. One instance sometimes crashes (unspecified kernel launch error). The program itself doesn’t use much RAM nor CPU, below 1 GB at GPU and host.

After several experiments single instance started to occupy 18 GB, but I believe I saw existing instance decreasing its memory usage back after closing the second instance.

  1. Compilation of .cu file is very slow in debug version (several minutes). It slows down after several messages like
    1> ptxas info : Function properties for _ZN74_INTERNAL_52_tmpxft_00002584_00000000_7_cudaDe_compute_61_cpp1_ii_a93c74ca5isnanEf
    1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

  2. CUDA initialization in .exe is very slow (several minutes). This fixes if I include “compute_61;sm_61” in VS project’s CUDA settings.

  3. Compiler doesn’t detect change in included .h and .cu files – solution build doesn’t recompile .cu.

  4. If I use GPU 1 for my calculations, long-executing kernels (1-2 seconds) and increase TDR timeout e.g. to 10 seconds, TeamViewer sometimes stops responding. I fixed this by switching calculations to GPU 2. All other problems reproduce on it too.

This is a genetic programming task, developed with Visual Studio 2013, I can send you debug or release exe and a small dataset to run it. I run it on a test machine through TeamViewer. 6-core intel core i7-5930K, 32 GB RAM, 2 GeForce GTX 1080 8 GB RAM cards, Windows 10 Pro 64-bit, DirectX 12, NVIDIA drivers 369.30, CUDA Toolkit 8.0.

I’m not very experienced with CUDA, maybe about 2 working months in total, but experienced in programming in whole (about 15 years). The current program version is primitive and very unoptimal, I am optimizing it now. It can also contain memory access errors. But I suppose the tools have to work with it too or at least report more specific error.

Gosh, too many questions.

I just pick up some ones that I can answer.

Nsight 5.2 debugging doesn’t work well with 369.30, it’s a driver issue and should be fixed by the latest driver 376.33.

You use 1080 which is a pascal card, compute_61;sm_61 should be included into project or the cuda will try to use the old version of sm, this will use a lot of time.

You should use software preemption mode to debug on GPU 1 or the TDR will be triggered and the whole display will stop.

Thank you! That’s already a lot of information!

or the cuda will try to use the old version of sm, this will use a lot of time.
Yes, I understood that it recompiles old version->61, but I’m wondering why so long. Compilation is in principle linear task. And there is only about 1000 lines of my code. The only aspect that may complicate it is a lot of inline methods, quite simple in principle though. To inline or not to inline…

Software preemtion was already turned on Screenshot by Lightshot

Hi MikhailM,

I’m not sure why so long, but if the GPU finds that there is no matched sm code, it will try to compile the old sm code to the GPU binary code, this will cost a lot of time, so I suggest that the right sm code always should be included.

For TDR issue, I made a mistake, software preemption is always on for GPU 1 debugging and the TDR , I guess that TeamViewer works like VNC, right? Actually the GPU 1 handles both display and CUDA computing, if your CUDA kernel which runs on GPU1 uses a lot of resources, it will slow down the display even without the Nsight.

Hope these can help you.

Best Regards
Harry

For TDR issue, I made a mistake, software preemption is always on for GPU 1
Ok

I guess that TeamViewer works like VNC, right?
I don’t know any details of its implementation. They are both Remote Desktop’s analogue, yes.

it will slow down the display even without the Nsight
Ah, I will check… I supposed that since GPU supports multitasking for long time, now they have to support it nicely. And I read that if blocks number is not big, they will not be obligatory spreaded by all multiprocessors. Like 20 blocks will occupy 10 of 20 multiprocessors. This was maybe also made to keep free more resources. But what I saw: two kernels are serialized (maybe because I used one stream? I didn’t work with GPU streams yet) and also this problem with displaying.

Thank you very much, harryz_! I installed latest NVIDIA drivers and now CUDA debugging and Nsight profiling work very well! So 1. 2. are fixed.

I also found that compilation of .cu files in VS is slow when Generate GPU Debug Info and Generate Line Number Information options are turned on.

Hi MikhailM,

Generating GPU Debug Info and line info will slow down the compilation as it needs time to generate the pdb file for debugging. You can turn them off while generating the release binaries.

And the profile issue still exists with the latest driver? Unspecified launch failures are always caused by the driver.

Best Regards
Harry

Hi!

Yes, I understand, but on 2-3x times slower computer compilation takes like 20 seconds maximum (there - for compute capability 2.1). Against several minutes on faster computer for capabilities 6.0. And several seconds on it without debug info.

Profiler mostly works - even if program doesn’t exit normally within timeout. Analyze All in Unguided mode can produce messages like “not enough data” for several sections sometimes, but this is a tiny problem.

Thanks!