using Nsight remote debug TX2 fails

xliu · June 9, 2017, 10:19am

hi,
I’m trying out Nsight eclipse version on my host PC (x86_64 Ubuntu VM) to remote debug CUDA programs on TX2. I got as far as starting the remote cuda program running from Nsight, but I failed to debug it.

it first print below log, which look normal for a remote debug session.

Last login: Fri Jun  9 18:00:20 2017 from 192.168.31.26
echo $PWD'>'
/bin/sh -c "cd \"/home/nvidia/test/Debug\";export NVPROF_TMPDIR=\"/tmp\";\"/usr/local/cuda-8.0/bin/cuda-gdbserver\" --cuda-use-lockfile=0 :2345 \"/home/nvidia/test/Debug/test\"";exit
nvidia@tegra-ubuntu:~$ echo $PWD'>'
/home/nvidia>
nvidia@tegra-ubuntu:~$ /bin/sh -c "cd \"/home/nvidia/test/Debug\";export NVPROF_TMPDIR=\"/tmp\";\"/usr/local/cuda-8.0/bin/cuda-gdbserver\" --cuda-use-lockfile=0 :2345 \"/home/nvidia/test/Debug/test\"";exit
Process /home/nvidia/test/Debug/test created; pid = 7172
Listening on port 2345
Remote debugging from host 192.168.31.26

Then, after freezing at this stage for a really long time (few minutes), it finally comes to the debug perspective and it shoot out this log, which I don’t understand.

Coalescing of the CUDA commands output is off.
$1 = 0xff
The target endianness is set automatically (currently little endian)

I read some earlier topics about this, some said it takes a dual-GPU target device to debug on GPU, because the current using GPU cannot be halted. Is that the reason I failed to debug? Or if not, any other reasons?

AastaLLL · June 12, 2017, 2:54am

Hi,

Do you follow this page to set up Nsight?

More, could you try if ssh work properly inside the Ubuntu VM?

xliu · June 12, 2017, 3:55am

hi AastaLLL,

the SSH must have been working, because I successfully started a cuda sample on target from host. As for the guide, yes I followed most part of it. I skipped setting the cross compiler configuration, and chose synchronized project mode.

AastaLLL · June 13, 2017, 5:43am

Hi,

Did you install cuda-toolkit host via JetPack?
Could you share your host cuda version?

Another possible reason is related to some kind of traffic shaper configuration.
Could you use Debug Run and check if there is more error log?

xliu · June 13, 2017, 7:26am

Yes, I installed everything through JetPack-L4T-3.0-linux-x64.run. Cuda version is V8.0.62

Here is the console output of remote run in Debug profile. This is a cross compiled project using cuda sample code matrixMul.

Last login: Tue Jun 13 15:16:44 2017 from 192.168.31.80
echo $PWD'>'
/bin/sh -c "cd \"/home/nvidia/test/Debug\";export LD_LIBRARY_PATH=\"/usr/local/cuda-8.0/lib64\":\${LD_LIBRARY_PATH};\"/home/nvidia/test/Debug/test\"";exit
nvidia@tegra-ubuntu:~$ echo $PWD'>'
/home/nvidia>
nvidia@tegra-ubuntu:~$ /bin/sh -c "cd \"/home/nvidia/test/Debug\";export LD_LIBRARY_PATH=\"/usr/local/cuda-8.0/lib64\":\${LD_LIBRARY_PATH};\"/home/nvidia/test/Debug/test\"";exit
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GP10B" with compute capability 6.2

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 1.57 GFlop/s, Time= 83.316 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
logout

xliu · June 13, 2017, 7:55am

hi AastaLLL,

I also tested the command line remote gdb. Although very slow, it works.

on remote target:

/usr/local/cuda/bin/cuda-gdbserver :8080 test
Process test created; pid = 3565
Listening on port 8080

on host machine:

xliu@ubuntu:~/work/cuda/test/Debug$ cuda-gdb test 
NVIDIA (R) CUDA Debugger
8.0 release
Portions Copyright (C) 2007-2016 NVIDIA Corporation
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/xliu/work/cuda/test/Debug/test...done.
(cuda-gdb) target remote 192.168.31.37:8080
Remote debugging using 192.168.31.37:8080

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
0x0000007fb7fd2d80 in ?? ()
(cuda-gdb) 
(cuda-gdb) b matrixMul.cu:364
Breakpoint 1 at 0x404654: file ../src/matrixMul.cu, line 364.
(cuda-gdb) c
Continuing.
warning: Could not load shared library symbols for 8 libraries, e.g. /lib/aarch64-linux-gnu/librt.so.1.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?

Breakpoint 1, main (argc=1, argv=0x7fffffef78) at ../src/matrixMul.cu:364
364	    if (checkCmdLineFlag(argc, (const char **)argv, "help") ||
(cuda-gdb) p argc
$1 = 1
(cuda-gdb) next

376	    int devID = 0;
(cuda-gdb) 
378	    if (checkCmdLineFlag(argc, (const char **)argv, "device"))
(cuda-gdb) p devID
$2 = 0
(cuda-gdb)

AastaLLL · June 14, 2017, 9:04am

Hi,

From your comment, you can successfully run application sometimes?

xliu · June 14, 2017, 9:23am

hi AastaLLL,

Using Nsight Eclipse Edition, I can ALWAYS run remote application, as shown in post #5. But can NEVER debug a remote appliction, as stated in post #1.

Using command line cuda-gdb on host and cuda-gdbserver on TX2, I can debug, as shown in post #6.

The goal is to use Nsight to do remote debugging.

AastaLLL · June 15, 2017, 7:18am

Hi,

Thanks for the clarification and also sorry for my previous misunderstanding.
We are discussing this internally. Will update to you later.

AastaLLL · June 15, 2017, 9:09am

By the way, could you also check this topic?
[url]https://devtalk.nvidia.com/default/topic/777599/jetson-tk1/permission-denied-when-remote-debugging-on-tk1/post/5167376/#5167376[/url]

xliu · June 19, 2017, 3:40am

hi AastaLLL,

I’ve read the topic. I never saw permission related logs, and I can run the app remotely, so I don’t think it’s the same issue.

How’s going with your internal discussion?

xliu · June 19, 2017, 3:40am

hi AastaLLL,

I’ve read the topic. I never saw permission related logs, and I can run the app remotely, so I don’t think it’s the same issue.

How’s going with your internal discussion?

AastaLLL · June 19, 2017, 4:58am

Hi,

Thanks for your feedback.
We still check this issue. Please wait for our update.

Thanks.

AastaLLL · June 20, 2017, 2:12am

Hi,

We need more information to find out the root cause. Please helps to provide:
You can attach files from the attachment button at the upper right corner of a posted comment.

1. cuda-gdb traces:
Go to Nsight EE console view during debug session hang → Click on TV like icon drop down-> Select gdb traces option-> copy the traces from console view.
2. Screen shot of Nsight EE debug perspective
3] Nsight EE log: $workspace/.metadata/.log

xliu · June 20, 2017, 3:18am

hi,
Here’s the info. Please let me know if you need more

workspace-log.txt (10.9 KB)
gdb-trace.txt (32.3 KB)
remote-shell.txt (27.4 KB)

AastaLLL · June 20, 2017, 3:48am

Thanks.

I will update to you if we have more information or need more logs.

AastaLLL · June 21, 2017, 2:10am

Hi,

From the attached cuda-gdb traces, we see that the breakpoint is set on the main function but the breakpoint was never hit. Hence the program keeps running in Nsight.
We keep tracking this.

More,
How do you connect to the device (wifi or ethernet)?
Please try connecting to Ethernet(if not already) and also try waiting for a longer time.

Thanks.

xliu · June 21, 2017, 3:29am

hi,

This time I wait longer, and it finally hit the breakpoint, after at least 30min. attached the gdb trace.

What takes it so long? I was using Ethernet.

Both my pc and TX2 board is in the same LAN, connected to a wifi router’s LAN ports.
gdb-trace-final.txt (39.7 KB)

xliu · June 21, 2017, 5:54am

It seems most of the time is used to load symbols from shared libraries. Is it possible that I disable these symbols loading?

AastaLLL · June 21, 2017, 6:13am

I think this is essential.
I will check if it is possible to bypass symbols loading.

Thanks for the feedback.