I am developing a CUDA program and have faced with many problems with it. Starting from most important.
-
Nsight / Start CUDA debugging doesn’t work. Debugger doesn’t stop on breakpoints in CUDA kernels. I got messages in Output like
“CUDA context created : 23ecbd80090
CUDA module loaded: 23ed75fcf00 cudaDe.cu.obj
CUDA grid launch failed: CUcontext: 2468731158672 CUmodule: 2468924608256 Function: _Z15initCalculationP14TDistrictState
CUDART error: cudaLaunch returned cudaErrorLaunchFailure
CUDART error: cudaMemcpy returned cudaErrorLaunchFailure” -
NVIDIA Visual Profiler fails to get advanced analysis information about my kernels. Errors are usually “unspecified”, “insufficient data” and so on Screenshot by Lightshot . The best I was able to get are some individual results when I made executable exiting after some iteration. Execution timeout in Profiler settings, pressing cancel – any of them breaks such data getting.
cudaDeviceSynchronize();
cudaProfilerStop();
doesn’t help. Debug/release doesn’t make much difference. -
If I start second instant of the program, host memory usage increases a lot and program starts consuming CPU. E.g. instead of 500 MB in becomes 10 GB for instance 1 and 18 GB for the instance 2 (why 18? not 8 or 16). Consuming here means about 100% of one core. One instance sometimes crashes (unspecified kernel launch error). The program itself doesn’t use much RAM nor CPU, below 1 GB at GPU and host.
After several experiments single instance started to occupy 18 GB, but I believe I saw existing instance decreasing its memory usage back after closing the second instance.
-
Compilation of .cu file is very slow in debug version (several minutes). It slows down after several messages like
1> ptxas info : Function properties for _ZN74_INTERNAL_52_tmpxft_00002584_00000000_7_cudaDe_compute_61_cpp1_ii_a93c74ca5isnanEf
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -
CUDA initialization in .exe is very slow (several minutes). This fixes if I include “compute_61;sm_61” in VS project’s CUDA settings.
-
Compiler doesn’t detect change in included .h and .cu files – solution build doesn’t recompile .cu.
-
If I use GPU 1 for my calculations, long-executing kernels (1-2 seconds) and increase TDR timeout e.g. to 10 seconds, TeamViewer sometimes stops responding. I fixed this by switching calculations to GPU 2. All other problems reproduce on it too.
This is a genetic programming task, developed with Visual Studio 2013, I can send you debug or release exe and a small dataset to run it. I run it on a test machine through TeamViewer. 6-core intel core i7-5930K, 32 GB RAM, 2 GeForce GTX 1080 8 GB RAM cards, Windows 10 Pro 64-bit, DirectX 12, NVIDIA drivers 369.30, CUDA Toolkit 8.0.
I’m not very experienced with CUDA, maybe about 2 working months in total, but experienced in programming in whole (about 15 years). The current program version is primitive and very unoptimal, I am optimizing it now. It can also contain memory access errors. But I suppose the tools have to work with it too or at least report more specific error.