how does single GPU debugging happen on kepler?

laughingrice · February 28, 2013, 9:12pm

I was wondering how single GPU debugging happen in Kepler environments with CUDA 5?

With Cuda 4.x under Fermi two cards were required as the debugging did real break points and stopped the GPU dead so it couldn’t be used for graphics.

The AMD GDebugger took a different route, saving the entire kernel state to CPU memory and exiting the kernel, so it was running custom kernels for debug.

Kepler supposedly should be able to do multi processing on the GPU (not fully up to date as to what level, i.e can it do actual GPU partitioning or not). I was wondering if it is taking that approach (i.e, only running part of the GPU, and doing real debug breakpoints) or if it takes the GDebugger approach of building custom kernels.

Does anyone have any idea?
I have one computer running a gtx690 and one running a gtx640 if it makes a difference.

sorry if the immediate somewhere and I missed it, I’ve been away from hard core GPU computing for a awhile.

Thanks

seibert · February 28, 2013, 11:29pm

Well, you don’t have to worry about this with your GTX 690, since it is two GPUs. :)

Greg · March 1, 2013, 3:35am

Single GPU CUDA debugging implemented in Nsight Visual Studio Edition 2.2 and above is supported on compute capability 1.1 and above devices. This is done by serializing the execution of kernels and implementing something close to instruction level pre-emption. This does not require compiler or JIT changes to the assembly code. This methodology will work on non-deterministic kernels. The current implementation does not work with CUDA Dynamic Parallelism but can be adapted to in a future release.

Single GPU Graphics debugging implemented in Nsight Visual Studio Edition 3.0 uses frame replay as the basis of single GPU debugging. This means that step speed is dependent on the latency to replay the frame to the same location as the breakpoint.

Other tools such as NVIDIA Fx Composer modify the kernel or shader program. All threads in the shader/kernel execute to the breakpoint and out state into device memory. The debugger then uses a form of replay to emulate run control.