Hi,
cuDNN and cuBLAS kernels are application kernels, not system kernels. System kernels are kernels like memcpy, memset, etc, mostly provided for tracing. These should not be relied upon for tracing as, depending on the size of the memory transaction, CUDA may take a different path to perform the transaction that does not involve launching a kernel, like using a copy engine. In this case, you won’t see a breakpoint or any notification with cuda-gdb. If you need to trace these calls, I recommend placing a CPU breakpoint or use gdb’s tracepoint feature in cuda-gdb on cudaMalloc, cudaMemset, etc.
That said, I am unable to reproduce the inability to break on kernel with the simpleCUBLAS example 2 listed here:
http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf
Compiled with:
nvcc -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -O3 -Xcompiler -msse -ccbin gcc -m64 -I/usr/local/cuda-8.0/include cublas.cu -o cublas -L/usr/local/cuda-8.0/lib64 -lcublas -lm -lstdc++ -lpthread
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) r
Starting program: ./cublas
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[New Thread 0x7ffff29fc700 (LWP 25157)]
[New Thread 0x7ffff21fb700 (LWP 25158)]
[New Thread 0x7ffff19fa700 (LWP 25159)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000000000c0ac88 in void scal_kernel_val<float, float, 0>(cublasScalParamsVal<float, float>)<<<(1,1,1),(256,1,1)>>> ()
(cuda-gdb) bt
#0 0x0000000000c0ac88 in void scal_kernel_val<float, float, 0>(cublasScalParamsVal<float, float>)<<<(1,1,1),(256,1,1)>>> ()
(cuda-gdb) x/i $pc
=> 0xc0ac88 <_Z15scal_kernel_valIffLi0EEv19cublasScalParamsValIT_T0_E+8>: MOV R1, c[0x0][0x20]
Please keep in mind that break_on_launch is a performance heavy command that increases your execution time by an amount that scales with the number of kernels compiled for your system. If you plan to use this, please be patient, cuBlas is rather large. We are aware of the issue and are working to fix this in a future release.