the lazy debugger: catching and debugging incurrences of cudbgReportDriverInternalError()

little_jimmy · April 26, 2015, 10:12am

this is not the first time this occurs
this is not the first time that this is causing me much frustration and plenty of wasted effort
and i am beginning to wonder why one should even bother with sound development practices and use the debugger, when the debugger is lazy

why does the debugger not halt/ terminate on internal errors; in my opinion this is fatal - as fatal as a segmentation error

look at the stack:

poll() at 0x7ffff6a8c8ad	
cudbgApiDetach() at 0x7ffff23cbc22	
cudbgReportDriverInternalError() at 0x7ffff23c6180	
cudbgReportDriverInternalError() at 0x7ffff23c7cff	
cuMemGetAttribute_v2() at 0x7ffff232cf8a	
cudbgGetAPIVersion() at 0x7ffff24485d8	
cudbgGetAPIVersion() at 0x7ffff2448a78	
cuMemGetAttribute_v2() at 0x7ffff2368a04	
cuMemGetAttribute_v2() at 0x7ffff2331c7c	
cuMemGetAttribute_v2() at 0x7ffff2332278	
cuMemGetAttribute_v2() at 0x7ffff22a13f2	
cuMemGetAttribute_v2() at 0x7ffff22a6685	
cuMemcpyDtoDAsync_v2() at 0x7ffff2283759	
cudart::cudaApiMemcpyAsync() at 0x43a47a
cudaMemcpyAsync() at 0x463526

did the debugger halt on registering a driver internal error? No
Should the debugger have halted? Best
Why did the debugger not halt? Good question

little_jimmy · April 27, 2015, 6:41am

below, another one
i am truly fortunate to even pick these up: at present, i suspend the debugger when execution takes longer than expected, knowing that i would likely be greeted by a cudbgReportDriverInternalError()
and it is fatal: it destabilizes the device, and causes erroneous results
the conventional, like cudaGetLastError(), seems to ignore cudbgReportDriverInternalError()
the debugger surely does not stop
i am not sure how one is supposed to debug the causes of cudbgReportDriverInternalError(), when cudbgReportDriverInternalError() is hardly reported

poll() at 0x7ffff6a8c8ad
cudbgApiDetach() at 0x7ffff23cbc22
cudbgReportDriverInternalError() at 0x7ffff23c6180
cudbgReportDriverInternalError() at 0x7ffff23c7cff
cuMemGetAttribute_v2() at 0x7ffff232cf8a
cuMemGetAttribute_v2() at 0x7ffff2349896
cuVDPAUCtxCreate() at 0x7ffff229dc26
cuVDPAUCtxCreate() at 0x7ffff229de43
cuLaunchKernel() at 0x7ffff2286cad
cudart::cudaApiLaunch() at 0x43f2a8
cudaLaunch() at 0x468523

tloockx · September 29, 2015, 10:17pm

Hi little_jimmy,

Did you solve this? I’m running into a similar problem:

“Error: Internal error reported by CUDA debugger API (error=10). The application cannot be further debugged.”

Which renders cuda-gdb unusable for me.

I’m working on Ubuntu 14.04 with cuda-gdb version 6.5. My code is compiled with the CUDA 7.0 toolkit and my driver version is:

NVRM version: NVIDIA UNIX x86_64 Kernel Module 346.82 Wed Jun 17 10:37:46 PDT 2015
GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

thanks,
Thomas Loockx

little_jimmy · November 23, 2015, 1:26pm

no, not yet

i made effort to also raise this internally with nvidia, but, honestly, nvidia’s software development team seems ‘over-stretched’ at present

i have found some correlation between stream races (races at the stream level) and this occurrence, to the extent that, whenever i encountered such an instance, i would double check my code for potential stream races

also, the debugger seems to be building and dumping some trace or log in the background; if that grows too large, there is also a tendency for this instance to occur

equally, i have found myself wondering whether the debugger does not seem poorly equipped to handle massively parallel streams involving too many asynchronous synchronization calls