Ocelot PTX Debugger
For anyone who is interested, I recently added an interactive debugger to the Ocelot PTX Emulator to help me track down bugs in ocelot. It may also be useful to people doing cuda development so I thought I would mention it here. To try it out, pull the latest revision of ocelot from subversion ( svn checkout [url="https://gpuocelot.googlecode.com/svn/trunk/"]https://gpuocelot.googlecode.com/svn/trunk/[/url] gpuocelot ), enter the ptx 2.1 branch (gpuocelot/branches/ocelot-ptx2.1) and build ocelot (libtoolize; aclocal; autoconf; automake; sudo make install)

To enable the debugger, turn it on in the config file (configure.ocelot):

[code]trace: {
.
.
.
debugger: {
enabled: true,
kernelFilter: "testShr",
alwaysAttach: false
},
.
.
.
}[/code]

Enabled determines whether or not the debugger is on, kernelFilter specifies the name of a specific kernel to attach the debugger to, and alwaysAttach will cause the debugger to attach to every kernel that is launched.

Here's an example of using it.

Launch a CUDA program linked against ocelot.
[code]normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./TestCudaSequence[/code]

It should attach the debugger to a kernel (here I specified the kernel 'testShr'). You should see something like this

[code](ocelot-dbg) Attaching debugger to kernel 'testShr'
(ocelot-dbg)[/code]

Get a help message:

[code](ocelot-dbg) Attaching debugger to kernel 'testShr'
(ocelot-dbg)[/code]

[code](ocelot-dbg) h

_ ___ _.--.
\`.|\..----...-'` `-._.-'_.-'`
/ ' ` , __.--'
)/' _/ \ `-_, /
`-'" `"\_ ,_.-;_.-\_ ',
_.-'_./ {_.' ; /
{_.-``-' {_/

This is the Ocelot Interactive PTX Debugger

Commands:
help (h) - Print this message.
jump (j) - Jump current warp to the specified PC.
remove (r) - Remove a breakpoint from a specific PC.
break (b) - Set a breakpoint at the specified PC.
print (p) - Print the value of a memory resource.
asm (a) - Print instructions near the specified PC.
reg (r) - Print the value of a register.
mem (m) - Print the values near an address.
warp (w) - Print the current warp status.
pc - Print the PC of the current warp.
loc (l) - Print the nearest CUDA source line.
step (s) - Execute the next instruction.
continue (c) - Run until the next breakpoint.
quit (q) - Detach the debugger, resume execution.[/code]

You can print out the kernel instructions

[code](ocelot-dbg) print asm
(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16
(1) - mov.u32 %r1, %ctaid.x
(2) - mov.u32 %r2, %ntid.x
(3) - mul.lo.u32 %r3, %r1, %r2
(4) - mov.u32 %r4, %tid.x
(5) - add.u32 %r5, %r3, %r4
(6) - mul.lo.u32 %r6, %r5, 4
(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]
(8) - add.u32 %r8, %r7, %r6
(9) - ld.global.s32 %r9, [%r8 + 0][/code]

Step through the program:

[code](ocelot-dbg) s
(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16
(ocelot-dbg) s
(1) - mov.u32 %r1, %ctaid.x
(ocelot-dbg) s
(2) - mov.u32 %r2, %ntid.x
(ocelot-dbg) s
(3) - mul.lo.u32 %r3, %r1, %r2
(ocelot-dbg) s
(4) - mov.u32 %r4, %tid.x
(ocelot-dbg) s
(5) - add.u32 %r5, %r3, %r4
(ocelot-dbg) s
(6) - mul.lo.u32 %r6, %r5, 4
(ocelot-dbg) s
(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]
(ocelot-dbg) s
(8) - add.u32 %r8, %r7, %r6
(ocelot-dbg) s
(9) - ld.global.s32 %r9, [%r8 + 0][/code]

View the state of the current warp

[code]CTA ID: (0, 0, 0)
Warp ID: 0
PC: 23
Context Stack Depth: 3
Active Mask: [10101010101010101010101010101010][/code]

View the current register file contents:

[code](ocelot-dbg) print r
THREAD 0 THREAD 1 THREAD 2 THREAD 3 THREAD 4
R0 0 0 0 0 0
R1 0 0 0 0 0
R2 20 20 20 20 20
R3 0 0 0 0 0
R4 0 1 2 3 4
R5 0 1 2 3 4
R6 0 4 8 c 10
R7 b3934c0 b3934c0 b3934c0 b3934c0 b3934c0
R8 b3934c0 b3934c4 b3934c8 b3934cc b3934d0
R9 0 2 4 6 8[/code]

View memory (X values are out of range)

[code](ocelot-dbg) p m b3934b0
000000000b3934b0 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
000000000b3934b8 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
000000000b3934c0 | 0000000200000000 0000000200000000 0000000200000000 0000000200000000 0000000200000000
000000000b3934c8 | 0000000600000004 0000000600000004 0000000600000004 0000000600000004 0000000600000004
000000000b3934d0 | 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008
000000000b3934d8 | 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c
000000000b3934e0 | 0000001200000010 0000001200000010 0000001200000010 0000001200000010 0000001200000010
000000000b3934e8 | 0000001600000014 0000001600000014 0000001600000014 0000001600000014 0000001600000014
000000000b3934f0 | 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018
000000000b3934f8 | 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c[/code]

Also, the debugger will automatically attach to programs that segfault. For example:

[code]__global__ void segfault(int* value)
{
*value = 0;
}

int main()
{

segfault<<<1, 1>>>((int*)0xdeadbeef);

return 0;
}[/code]

Will generate the following message:

[code]normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./errors
(ocelot-dbg) Global memory access violation at 0xdeadbeef
(ocelot-dbg) Attaching ocelot debugger.
(ocelot-dbg) Breaking into program now![/code]

This should be in the main ocelot release around september, I'll also add support for inspecting other memory spaces and the function call stack relatively soon. Any suggestions or feature requests would be welcome.
For anyone who is interested, I recently added an interactive debugger to the Ocelot PTX Emulator to help me track down bugs in ocelot. It may also be useful to people doing cuda development so I thought I would mention it here. To try it out, pull the latest revision of ocelot from subversion ( svn checkout https://gpuocelot.googlecode.com/svn/trunk/ gpuocelot ), enter the ptx 2.1 branch (gpuocelot/branches/ocelot-ptx2.1) and build ocelot (libtoolize; aclocal; autoconf; automake; sudo make install)



To enable the debugger, turn it on in the config file (configure.ocelot):



trace: {

.

.

.

debugger: {

enabled: true,

kernelFilter: "testShr",

alwaysAttach: false

},

.

.

.

}




Enabled determines whether or not the debugger is on, kernelFilter specifies the name of a specific kernel to attach the debugger to, and alwaysAttach will cause the debugger to attach to every kernel that is launched.



Here's an example of using it.



Launch a CUDA program linked against ocelot.

normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./TestCudaSequence




It should attach the debugger to a kernel (here I specified the kernel 'testShr'). You should see something like this



(ocelot-dbg) Attaching debugger to kernel 'testShr'

(ocelot-dbg)




Get a help message:



(ocelot-dbg) Attaching debugger to kernel 'testShr'

(ocelot-dbg)




(ocelot-dbg) h



_ ___ _.--.

\`.|\..----...-'` `-._.-'_.-'`

/ ' ` , __.--'

)/' _/ \ `-_, /

`-'" `"\_ ,_.-;_.-\_ ',

_.-'_./ {_.' ; /

{_.-``-' {_/



This is the Ocelot Interactive PTX Debugger



Commands:

help (h) - Print this message.

jump (j) - Jump current warp to the specified PC.

remove (r) - Remove a breakpoint from a specific PC.

break (b) - Set a breakpoint at the specified PC.

print (p) - Print the value of a memory resource.

asm (a) - Print instructions near the specified PC.

reg (r) - Print the value of a register.

mem (m) - Print the values near an address.

warp (w) - Print the current warp status.

pc - Print the PC of the current warp.

loc (l) - Print the nearest CUDA source line.

step (s) - Execute the next instruction.

continue (c) - Run until the next breakpoint.

quit (q) - Detach the debugger, resume execution.




You can print out the kernel instructions



(ocelot-dbg) print asm

(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16

(1) - mov.u32 %r1, %ctaid.x

(2) - mov.u32 %r2, %ntid.x

(3) - mul.lo.u32 %r3, %r1, %r2

(4) - mov.u32 %r4, %tid.x

(5) - add.u32 %r5, %r3, %r4

(6) - mul.lo.u32 %r6, %r5, 4

(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]

(8) - add.u32 %r8, %r7, %r6

(9) - ld.global.s32 %r9, [%r8 + 0]




Step through the program:



(ocelot-dbg) s   

(0) - mov.u32 %r0, __cuda___cuda_local_var_24552_30_storage16

(ocelot-dbg) s

(1) - mov.u32 %r1, %ctaid.x

(ocelot-dbg) s

(2) - mov.u32 %r2, %ntid.x

(ocelot-dbg) s

(3) - mul.lo.u32 %r3, %r1, %r2

(ocelot-dbg) s

(4) - mov.u32 %r4, %tid.x

(ocelot-dbg) s

(5) - add.u32 %r5, %r3, %r4

(ocelot-dbg) s

(6) - mul.lo.u32 %r6, %r5, 4

(ocelot-dbg) s

(7) - ld.param.u32 %r7, [__cudaparm_testShr_B + 4]

(ocelot-dbg) s

(8) - add.u32 %r8, %r7, %r6

(ocelot-dbg) s

(9) - ld.global.s32 %r9, [%r8 + 0]




View the state of the current warp



CTA ID:			  (0, 0, 0)

Warp ID: 0

PC: 23

Context Stack Depth: 3

Active Mask: [10101010101010101010101010101010]




View the current register file contents:



(ocelot-dbg) print r

THREAD 0 THREAD 1 THREAD 2 THREAD 3 THREAD 4

R0 0 0 0 0 0

R1 0 0 0 0 0

R2 20 20 20 20 20

R3 0 0 0 0 0

R4 0 1 2 3 4

R5 0 1 2 3 4

R6 0 4 8 c 10

R7 b3934c0 b3934c0 b3934c0 b3934c0 b3934c0

R8 b3934c0 b3934c4 b3934c8 b3934cc b3934d0

R9 0 2 4 6 8




View memory (X values are out of range)



(ocelot-dbg) p m b3934b0	

000000000b3934b0 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX

000000000b3934b8 | XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX

000000000b3934c0 | 0000000200000000 0000000200000000 0000000200000000 0000000200000000 0000000200000000

000000000b3934c8 | 0000000600000004 0000000600000004 0000000600000004 0000000600000004 0000000600000004

000000000b3934d0 | 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008 0000000a00000008

000000000b3934d8 | 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c 0000000e0000000c

000000000b3934e0 | 0000001200000010 0000001200000010 0000001200000010 0000001200000010 0000001200000010

000000000b3934e8 | 0000001600000014 0000001600000014 0000001600000014 0000001600000014 0000001600000014

000000000b3934f0 | 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018 0000001a00000018

000000000b3934f8 | 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c 0000001e0000001c




Also, the debugger will automatically attach to programs that segfault. For example:



__global__ void segfault(int* value)

{

*value = 0;

}



int main()

{



segfault<<<1, 1>>>((int*)0xdeadbeef);



return 0;

}




Will generate the following message:



normal@atom:~/checkout/gpuocelot/branches/ocelot-ptx-2.1$ ./errors 

(ocelot-dbg) Global memory access violation at 0xdeadbeef

(ocelot-dbg) Attaching ocelot debugger.

(ocelot-dbg) Breaking into program now!




This should be in the main ocelot release around september, I'll also add support for inspecting other memory spaces and the function call stack relatively soon. Any suggestions or feature requests would be welcome.

#1
Posted 07/21/2010 02:31 PM   
Greg, look at what you've done! You just spoiled today's announcement of [url="http://blogs.nvidia.com/ntersect/2010/07/parallel-nsight-powerful-new-solution-for-windows-gpgpu-development.html"]NVIDIA Parallel Nsight's first release[/url] :D

I like the way it is integrated within the trace framework. And the ocelot ASCII-art splash screen is definitely a killer feature.

Did you consider implementing GDB's remote serial protocol (which would allow debugging from most Unix-based debuggers and IDEs, such as ddd, emacs, eclipse...)?
Also, would it be possible/difficult to parse the debug info produced when compiling with nvcc -G and debug at the source level?
Greg, look at what you've done! You just spoiled today's announcement of NVIDIA Parallel Nsight's first release :D



I like the way it is integrated within the trace framework. And the ocelot ASCII-art splash screen is definitely a killer feature.



Did you consider implementing GDB's remote serial protocol (which would allow debugging from most Unix-based debuggers and IDEs, such as ddd, emacs, eclipse...)?

Also, would it be possible/difficult to parse the debug info produced when compiling with nvcc -G and debug at the source level?
#2
Posted 07/21/2010 04:08 PM   
Awesome, ASCII art! :D

All kidding aside, this is extremely cool.
Awesome, ASCII art! :D



All kidding aside, this is extremely cool.

#3
Posted 07/21/2010 04:44 PM   
[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']Greg, look at what you've done! You just spoiled today's announcement of [url="http://blogs.nvidia.com/ntersect/2010/07/parallel-nsight-powerful-new-solution-for-windows-gpgpu-development.html"]NVIDIA Parallel Nsight's first release[/url] :D[/quote]

I didn't time it out, I swear! :)

I think that they are complementary, Ocelot is only able to debug on the emulator for now and Nsight/cuda-gdb can only debug on a GPU.

[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']I like the way it is integrated within the trace framework. And the ocelot ASCII-art splash screen is definitely a killer feature.[/quote]

I have been pleasantly surprised by how useful the trace generation interface has been. We spent about 2 hours designing it to drive an in-house simulator, and have been able to reuse it for most of the debugging tools in ocelot.

[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']Did you consider implementing GDB's remote serial protocol (which would allow debugging from most Unix-based debuggers and IDEs, such as ddd, emacs, eclipse...)?
Also, would it be possible/difficult to parse the debug info produced when compiling with nvcc -G and debug at the source level?[/quote]

Eventually I'd like debugging tools for Ocelot targets to merge with standard tools like gdb or cuda-dbg so that it would be possible to debug on either the emulator or an actual device using the same tool. The implementation that I just added was trying to get as much functionality with as little effort possible (it consists of 1 500 line file that I wrote last night). The plan is to incrementally move towards something more complete, but to get the essential features in as fast as possible.
[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']Greg, look at what you've done! You just spoiled today's announcement of NVIDIA Parallel Nsight's first release :D



I didn't time it out, I swear! :)



I think that they are complementary, Ocelot is only able to debug on the emulator for now and Nsight/cuda-gdb can only debug on a GPU.



[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']I like the way it is integrated within the trace framework. And the ocelot ASCII-art splash screen is definitely a killer feature.



I have been pleasantly surprised by how useful the trace generation interface has been. We spent about 2 hours designing it to drive an in-house simulator, and have been able to reuse it for most of the debugging tools in ocelot.



[quote name='Sylvain Collange' post='1091302' date='Jul 21 2010, 05:08 PM']Did you consider implementing GDB's remote serial protocol (which would allow debugging from most Unix-based debuggers and IDEs, such as ddd, emacs, eclipse...)?

Also, would it be possible/difficult to parse the debug info produced when compiling with nvcc -G and debug at the source level?



Eventually I'd like debugging tools for Ocelot targets to merge with standard tools like gdb or cuda-dbg so that it would be possible to debug on either the emulator or an actual device using the same tool. The implementation that I just added was trying to get as much functionality with as little effort possible (it consists of 1 500 line file that I wrote last night). The plan is to incrementally move towards something more complete, but to get the essential features in as fast as possible.

#4
Posted 07/21/2010 06:40 PM   
Awesome, this emulator is right on time!
Awesome, this emulator is right on time!

#5
Posted 07/21/2010 09:37 PM   
Nice work, Greg!

That's a nice succinct framework for adding features.

I was so inspired that last night and tonight, I implemented watchpoints and added them to the integrated debugger. Similar to GDB's watch points, the user may select regions in global memory, identified by base address, PTX type, and number of elements. Threads writing to these regions trigger a breakpoint which displays information about which threads are writing, the old and new values, then enters the integrated debugger for additional inspection.

To set a watch point, use the command
[i]watch global <address> <ptx-type>[/i]
where
[i]<address>[/i] is in an allocated region of global device memory, such as returned by cudaMalloc()
and
[i]<ptx-type>[/i] is one of { u8 s8 b8 u16 s16 b16 u32 s32 b32 f32 u64 s64 b64 f64 }.

A watchpoint may be defined as region containing an array of elements using the following syntax:
[i]watch global <address> <ptx-type>[ <element-count> ] [/i]
where
[i]<element-count>[/i] is the number of elements in the array.

Here is an example from TestCudaSequence [ [i]ocelot/cuda/tests/kernels/sequence.cu[/i] ].

[code](ocelot-dbg) watch global address 0x16dcbe0 s32[4]
set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes[/code]

Watchpoints may be viewed with the 'list' command and cleared with the 'clear' command as in:

[code](ocelot-dbg) list
#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes
#2 - watch global address 0xdeadbeef f32[1] - 4 bytes
#3 - watch global address 0x1337 f32[3] - 12 bytes
(ocelot-dbg)
(ocelot-dbg)
(ocelot-dbg) clear 2
(ocelot-dbg) list
#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes
#2 - watch global address 0x1337 f32[3] - 12 bytes
(ocelot-dbg)
(ocelot-dbg) clear
(ocelot-dbg) list
(ocelot-dbg)[/code]

Watchpoints result in a breakpoint after store instructions. Old and new values are printed for each thread whose store accesses the watchpoint's region.

[code]andrew@ ocelot-ptx-2.1$ ./TestCudaSequence
A_gpu = 0x16dcbe0
(ocelot-dbg) Attaching debugger to kernel 'sequence'
(ocelot-dbg) watch global address 0x16dcbe0 s32[4]
set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes
(ocelot-dbg) continue
st.global.s32 [%r11 + 0], %r7
watchpoint #1 - CTA (0, 0)
thread (0, 0, 0) - store to 0x16dcbe0 4 bytes
old value = -1
new value = 0
thread (1, 0, 0) - store to 0x16dcbe4 4 bytes
old value = -1
new value = 2
thread (2, 0, 0) - store to 0x16dcbe8 4 bytes
old value = -1
new value = 4
thread (3, 0, 0) - store to 0x16dcbec 4 bytes
old value = -1
new value = 6
break on watchpoint
(ocelot-dbg)[/code]

The values reported when a watchpoint is triggered are aligned according to addresses generated by the writing thread(s). For example, assume cudaMalloc() returns an allocation with address 0x1c77be0, but a 2-byte offset is added when defining the starting address of a watched region. Four threads writing to 16 consecutive bytes touch the following 12-byte watchpoint, and so their values are printed.

[code](ocelot-dbg) watch global address 0x1c77be2 s32[3]
set #1: watch global address 0x1c77be2 s32[3] - 12 bytes
(ocelot-dbg) continue
st.global.s32 [%r11 + 0], %r7
watchpoint #1 - CTA (0, 0)
thread (0, 0, 0) - store to 0x1c77be0 4 bytes
old value = -1
new value = 0
thread (1, 0, 0) - store to 0x1c77be4 4 bytes
old value = -1
new value = 2
thread (2, 0, 0) - store to 0x1c77be8 4 bytes
old value = -1
new value = 4
thread (3, 0, 0) - store to 0x1c77bec 4 bytes
old value = -1
new value = 6
break on watchpoint
(ocelot-dbg)[/code]
Previously, I had considered implementing this functionality as a stand-alone trace generator with an API for registering addresses, but an interactive approach like ocelot-dbg seems much more user-friendly.

Some additional enhancements that would be within a reasonable scope include:
[list]
[*] referencing PTX objects by symbol - this would include __device__ globals, variables in shared memory, and parameters
[*] address spaces beyond global: shared, texture, local, param
[*] retrieving allocation addresses from debug symbols in the host-side application
[/list]

Feel free to suggest others or to express your interest for a particular feature.

Committed to r626 of gpuocelot/branches/ocelot-ptx-2.1
Nice work, Greg!



That's a nice succinct framework for adding features.



I was so inspired that last night and tonight, I implemented watchpoints and added them to the integrated debugger. Similar to GDB's watch points, the user may select regions in global memory, identified by base address, PTX type, and number of elements. Threads writing to these regions trigger a breakpoint which displays information about which threads are writing, the old and new values, then enters the integrated debugger for additional inspection.



To set a watch point, use the command

watch global <address> <ptx-type>

where

<address> is in an allocated region of global device memory, such as returned by cudaMalloc()

and

<ptx-type> is one of { u8 s8 b8 u16 s16 b16 u32 s32 b32 f32 u64 s64 b64 f64 }.



A watchpoint may be defined as region containing an array of elements using the following syntax:

watch global <address> <ptx-type>[ <element-count> ]

where

<element-count> is the number of elements in the array.



Here is an example from TestCudaSequence [ ocelot/cuda/tests/kernels/sequence.cu ].



(ocelot-dbg) watch global address 0x16dcbe0 s32[4]

set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes




Watchpoints may be viewed with the 'list' command and cleared with the 'clear' command as in:



(ocelot-dbg) list

#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes

#2 - watch global address 0xdeadbeef f32[1] - 4 bytes

#3 - watch global address 0x1337 f32[3] - 12 bytes

(ocelot-dbg)

(ocelot-dbg)

(ocelot-dbg) clear 2

(ocelot-dbg) list

#1 - watch global address 0x16dcbe0 s32[4] - 16 bytes

#2 - watch global address 0x1337 f32[3] - 12 bytes

(ocelot-dbg)

(ocelot-dbg) clear

(ocelot-dbg) list

(ocelot-dbg)




Watchpoints result in a breakpoint after store instructions. Old and new values are printed for each thread whose store accesses the watchpoint's region.



andrew@ ocelot-ptx-2.1$ ./TestCudaSequence 

A_gpu = 0x16dcbe0

(ocelot-dbg) Attaching debugger to kernel 'sequence'

(ocelot-dbg) watch global address 0x16dcbe0 s32[4]

set #1: watch global address 0x16dcbe0 s32[4] - 16 bytes

(ocelot-dbg) continue

st.global.s32 [%r11 + 0], %r7

watchpoint #1 - CTA (0, 0)

thread (0, 0, 0) - store to 0x16dcbe0 4 bytes

old value = -1

new value = 0

thread (1, 0, 0) - store to 0x16dcbe4 4 bytes

old value = -1

new value = 2

thread (2, 0, 0) - store to 0x16dcbe8 4 bytes

old value = -1

new value = 4

thread (3, 0, 0) - store to 0x16dcbec 4 bytes

old value = -1

new value = 6

break on watchpoint

(ocelot-dbg)




The values reported when a watchpoint is triggered are aligned according to addresses generated by the writing thread(s). For example, assume cudaMalloc() returns an allocation with address 0x1c77be0, but a 2-byte offset is added when defining the starting address of a watched region. Four threads writing to 16 consecutive bytes touch the following 12-byte watchpoint, and so their values are printed.



(ocelot-dbg) watch global address 0x1c77be2 s32[3]

set #1: watch global address 0x1c77be2 s32[3] - 12 bytes

(ocelot-dbg) continue

st.global.s32 [%r11 + 0], %r7

watchpoint #1 - CTA (0, 0)

thread (0, 0, 0) - store to 0x1c77be0 4 bytes

old value = -1

new value = 0

thread (1, 0, 0) - store to 0x1c77be4 4 bytes

old value = -1

new value = 2

thread (2, 0, 0) - store to 0x1c77be8 4 bytes

old value = -1

new value = 4

thread (3, 0, 0) - store to 0x1c77bec 4 bytes

old value = -1

new value = 6

break on watchpoint

(ocelot-dbg)


Previously, I had considered implementing this functionality as a stand-alone trace generator with an API for registering addresses, but an interactive approach like ocelot-dbg seems much more user-friendly.



Some additional enhancements that would be within a reasonable scope include:


  • referencing PTX objects by symbol - this would include __device__ globals, variables in shared memory, and parameters
  • address spaces beyond global: shared, texture, local, param
  • retrieving allocation addresses from debug symbols in the host-side application




Feel free to suggest others or to express your interest for a particular feature.



Committed to r626 of gpuocelot/branches/ocelot-ptx-2.1

#6
Posted 07/23/2010 04:16 AM   
Scroll To Top