On the host side, I can call gdb.inferiors()[0].read_memory(data_address, size) to dump process memory using the gdb python interface. However when I do this for device memory, I get all zeros for output.
Is this just not implemented in cuda-gdb or is there something else I need to call?
Following workaround allows reading device memory from python. For example, using an application, and breaking before the kernel launch (line 52), 0x10207600000 is the address of the array on the device:
(cuda-gdb) l
47
48 HANDLE_CUDA_ERROR(cudaMalloc((void**)&d, sizeof(int)N));
49 HANDLE_CUDA_ERROR(cudaEventCreate(&asyncWaitEvent));
50 HANDLE_CUDA_ERROR(cudaMemcpy(d, idata, sizeof(int)N, cudaMemcpyHostToDevice));
51
52 bitreverse<<<1, N, Nsizeof(int)>>>(d);
53 HANDLE_CUDA_ERROR(cudaGetLastError());
54 HANDLE_CUDA_ERROR(cudaEventRecord(asyncWaitEvent, 0));
55
56 / Spin on the host while kernel is running */
(cuda-gdb) p d
$3 = (void *) 0x10207600000
(cuda-gdb) python