Ocelot PTX Debugger

For anyone who is interested, I recently added an interactive debugger to the Ocelot PTX Emulator to help me track down bugs in ocelot. It may also be useful to people doing cuda development so I thought I would mention it here. To try it out, pull the latest revision of ocelot from subversion ( svn checkout https://gpuocelot.googlecode.com/svn/trunk/ gpuocelot ), enter the ptx 2.1 branch (gpuocelot/branches/ocelot-ptx2.1) and build ocelot (libtoolize; aclocal; autoconf; automake; sudo make install)

To enable the debugger, turn it on in the config file (configure.ocelot):

trace: {

.

.

.

		debugger: {

			enabled: true,

			kernelFilter: "testShr",

			alwaysAttach: false

		},

.

.

.

}

Enabled determines whether or not the debugger is on, kernelFilter specifies the name of a specific kernel to attach the debugger to, and alwaysAttach will cause the debugger to attach to every kernel that is launched.

Here’s an example of using it.

Launch a CUDA program linked against ocelot.

normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./TestCudaSequence

It should attach the debugger to a kernel (here I specified the kernel ‘testShr’). You should see something like this

(ocelot-dbg) Attaching debugger to kernel 'testShr'

(ocelot-dbg)

Get a help message:

(ocelot-dbg) Attaching debugger to kernel 'testShr'

(ocelot-dbg)
(ocelot-dbg) h

_				___	   _.--.

  \`.|\..----...-'`   `-._.-'_.-'`

  /  ' `		 ,	   __.--'   

  )/' _/	 \   `-_,   /		 

  `-'" `"\_  ,_.-;_.-\_ ',	  

	  _.-'_./   {_.'  ; /		

	 {_.-``-'		 {_/		 

This is the Ocelot Interactive PTX Debugger

Commands:

  help	 (h) - Print this message.

  jump	 (j) - Jump current warp to the specified PC.

  remove   (r) - Remove a breakpoint from a specific PC.

  break	(b) - Set a breakpoint at the specified PC.

  print	(p) - Print the value of a memory resource.

   asm  (a) - Print instructions near the specified PC.

   reg  (r) - Print the value of a register.

   mem  (m) - Print the values near an address.

   warp (w) - Print the current warp status.

   pc	   - Print the PC of the current warp.

   loc  (l) - Print the nearest CUDA source line.

  step	 (s) - Execute the next instruction.

  continue (c) - Run until the next breakpoint.

  quit	 (q) - Detach the debugger, resume execution.

You can print out the kernel instructions

(ocelot-dbg) print asm

(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16

(1) - mov.u32 %r1, %ctaid.x

(2) - mov.u32 %r2, %ntid.x

(3) - mul.lo.u32 %r3, %r1, %r2

(4) - mov.u32 %r4, %tid.x

(5) - add.u32 %r5, %r3, %r4

(6) - mul.lo.u32 %r6, %r5, 4

(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]

(8) - add.u32 %r8, %r7, %r6

(9) - ld.global.s32 %r9, [%r8 + 0]

Step through the program:

(ocelot-dbg) s   

(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16

(ocelot-dbg) s

(1) - mov.u32 %r1, %ctaid.x

(ocelot-dbg) s

(2) - mov.u32 %r2, %ntid.x

(ocelot-dbg) s

(3) - mul.lo.u32 %r3, %r1, %r2

(ocelot-dbg) s

(4) - mov.u32 %r4, %tid.x

(ocelot-dbg) s

(5) - add.u32 %r5, %r3, %r4

(ocelot-dbg) s

(6) - mul.lo.u32 %r6, %r5, 4

(ocelot-dbg) s

(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]

(ocelot-dbg) s

(8) - add.u32 %r8, %r7, %r6

(ocelot-dbg) s

(9) - ld.global.s32 %r9, [%r8 + 0]

View the state of the current warp

CTA ID:			  (0, 0, 0)

Warp ID:			 0

PC:				  23

Context Stack Depth: 3

Active Mask:		 [10101010101010101010101010101010]

View the current register file contents:

(ocelot-dbg) print r

			 THREAD 0		 THREAD 1		 THREAD 2		 THREAD 3		 THREAD 4 

R0				  0				0				0				0				0 

R1				  0				0				0				0				0 

R2				 20			   20			   20			   20			   20 

R3				  0				0				0				0				0 

R4				  0				1				2				3				4 

R5				  0				1				2				3				4 

R6				  0				4				8				c			   10 

R7			b3934c0		  b3934c0		  b3934c0		  b3934c0		  b3934c0 

R8			b3934c0		  b3934c4		  b3934c8		  b3934cc		  b3934d0 

R9				  0				2				4				6				8

View memory (X values are out of range)

(ocelot-dbg) p m b3934b0	

000000000b3934b0 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX 

000000000b3934b8 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX 

000000000b3934c0 | 0000000200000000 0000000200000000 0000000200000000 0000000200000000 0000000200000000 

000000000b3934c8 | 0000000600000004 0000000600000004 0000000600000004 0000000600000004 0000000600000004 

000000000b3934d0 | 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008 

000000000b3934d8 | 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c 

000000000b3934e0 | 0000001200000010 0000001200000010 0000001200000010 0000001200000010 0000001200000010 

000000000b3934e8 | 0000001600000014 0000001600000014 0000001600000014 0000001600000014 0000001600000014 

000000000b3934f0 | 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018 

000000000b3934f8 | 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c

Also, the debugger will automatically attach to programs that segfault. For example:

__global__ void segfault(int* value)

{

	*value = 0;

}

int main()

{

	

	segfault<<<1, 1>>>((int*)0xdeadbeef);

	

	return 0;

}

Will generate the following message:

normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./errors 

(ocelot-dbg) Global memory access violation at 0xdeadbeef

(ocelot-dbg) Attaching ocelot debugger.

(ocelot-dbg) Breaking into program now!

This should be in the main ocelot release around september, I’ll also add support for inspecting other memory spaces and the function call stack relatively soon. Any suggestions or feature requests would be welcome.

Greg, look at what you’ve done! You just spoiled today’s announcement of NVIDIA Parallel Nsight’s first release :D

I like the way it is integrated within the trace framework. And the ocelot ASCII-art splash screen is definitely a killer feature.

Did you consider implementing GDB’s remote serial protocol (which would allow debugging from most Unix-based debuggers and IDEs, such as ddd, emacs, eclipse…)?
Also, would it be possible/difficult to parse the debug info produced when compiling with nvcc -G and debug at the source level?

Awesome, ASCII art! :D

All kidding aside, this is extremely cool.

I didn’t time it out, I swear! :)

I think that they are complementary, Ocelot is only able to debug on the emulator for now and Nsight/cuda-gdb can only debug on a GPU.

I have been pleasantly surprised by how useful the trace generation interface has been. We spent about 2 hours designing it to drive an in-house simulator, and have been able to reuse it for most of the debugging tools in ocelot.

Eventually I’d like debugging tools for Ocelot targets to merge with standard tools like gdb or cuda-dbg so that it would be possible to debug on either the emulator or an actual device using the same tool. The implementation that I just added was trying to get as much functionality with as little effort possible (it consists of 1 500 line file that I wrote last night). The plan is to incrementally move towards something more complete, but to get the essential features in as fast as possible.

Awesome, this emulator is right on time!

Nice work, Greg!

That’s a nice succinct framework for adding features.

I was so inspired that last night and tonight, I implemented watchpoints and added them to the integrated debugger. Similar to GDB’s watch points, the user may select regions in global memory, identified by base address, PTX type, and number of elements. Threads writing to these regions trigger a breakpoint which displays information about which threads are writing, the old and new values, then enters the integrated debugger for additional inspection.

To set a watch point, use the command

watch global

where

is in an allocated region of global device memory, such as returned by cudaMalloc()

and

is one of { u8 s8 b8 u16 s16 b16 u32 s32 b32 f32 u64 s64 b64 f64 }.

A watchpoint may be defined as region containing an array of elements using the following syntax:

watch global [ ]

where

is the number of elements in the array.

Here is an example from TestCudaSequence [ ocelot/cuda/tests/kernels/sequence.cu ].

(ocelot-dbg) watch global address 0x16dcbe0 s32[4]

set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes

Watchpoints may be viewed with the ‘list’ command and cleared with the ‘clear’ command as in:

(ocelot-dbg) list

#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes

#2 - watch global address 0xdeadbeef f32[1] - 4 bytes

#3 - watch global address 0x1337 f32[3] - 12 bytes

(ocelot-dbg) 

(ocelot-dbg) 

(ocelot-dbg) clear 2

(ocelot-dbg) list

#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes

#2 - watch global address 0x1337 f32[3] - 12 bytes

(ocelot-dbg) 

(ocelot-dbg) clear

(ocelot-dbg) list

(ocelot-dbg)

Watchpoints result in a breakpoint after store instructions. Old and new values are printed for each thread whose store accesses the watchpoint’s region.

andrew@ ocelot-ptx-2.1$ ./TestCudaSequence 

A_gpu = 0x16dcbe0

(ocelot-dbg) Attaching debugger to kernel 'sequence'

(ocelot-dbg) watch global address 0x16dcbe0 s32[4]

set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes

(ocelot-dbg) continue

st.global.s32 [%r11 + 0], %r7

watchpoint #1 -  CTA (0, 0)

  thread (0, 0, 0) - store to 0x16dcbe0 4 bytes

  old value = -1

  new value = 0

  thread (1, 0, 0) - store to 0x16dcbe4 4 bytes

  old value = -1

  new value = 2

  thread (2, 0, 0) - store to 0x16dcbe8 4 bytes

  old value = -1

  new value = 4

  thread (3, 0, 0) - store to 0x16dcbec 4 bytes

  old value = -1

  new value = 6

break on watchpoint

(ocelot-dbg)

The values reported when a watchpoint is triggered are aligned according to addresses generated by the writing thread(s). For example, assume cudaMalloc() returns an allocation with address 0x1c77be0, but a 2-byte offset is added when defining the starting address of a watched region. Four threads writing to 16 consecutive bytes touch the following 12-byte watchpoint, and so their values are printed.

(ocelot-dbg) watch global address 0x1c77be2 s32[3]

set #1: watch global address 0x1c77be2 s32[3] - 12 bytes

(ocelot-dbg) continue

st.global.s32 [%r11 + 0], %r7

watchpoint #1 -  CTA (0, 0)

  thread (0, 0, 0) - store to 0x1c77be0 4 bytes

  old value = -1

  new value = 0

  thread (1, 0, 0) - store to 0x1c77be4 4 bytes

  old value = -1

  new value = 2

  thread (2, 0, 0) - store to 0x1c77be8 4 bytes

  old value = -1

  new value = 4

  thread (3, 0, 0) - store to 0x1c77bec 4 bytes

  old value = -1

  new value = 6

break on watchpoint

(ocelot-dbg)

Previously, I had considered implementing this functionality as a stand-alone trace generator with an API for registering addresses, but an interactive approach like ocelot-dbg seems much more user-friendly.

Some additional enhancements that would be within a reasonable scope include:

    [*] referencing PTX objects by symbol - this would include device globals, variables in shared memory, and parameters

    [*] address spaces beyond global: shared, texture, local, param

    [*] retrieving allocation addresses from debug symbols in the host-side application

Feel free to suggest others or to express your interest for a particular feature.

Committed to r626 of gpuocelot/branches/ocelot-ptx-2.1