nvprof is too slow

Hi,

I use these nvprof options
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace

with my mpirun command and the run takes about 14 minutes

when I try these options, nvprof runs for hours and still no output
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–aggregate-mode on --metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i want to get information on the memory bandwidth. any suggestions? thanks

YAH

BTW, I’m using 4 nodes over OpenMPI. I’m just trying to profile one host. I have 2 K40 GPUs. I see that there is some activity

eplayed on device 0 in order to collect all events/metrics.
==28627== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
2017-05-24 08:06:28.368

but i don’t see much movement

09:23:53 up 22:27, 1 user, load average: 2.45, 2.17, 2.25
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/5 10.31.39.9 09:21 0.00s 0.01s 0.00s w

|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:02:00.0 Off | 0 |
| N/A 40C P0 77W / 235W | 11249MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K40m Off | 0000:84:00.0 Off | 0 |
| N/A 34C P0 78W / 235W | 11183MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28627 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11145MiB |
| 1 28626 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11081MiB |
±----------------------------------------------------------------------------+

thanks

YAH

I tried again, this time with fewer options, it failed after 30 minutes

nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–metrics l2_utilization,texture_utilization

[cn3005:11625] *** Process received signal ***
[cn3005:11626] *** Process received signal ***
[cn3005:11625] Signal: Bus error (7)
[cn3005:11625] Signal code: Non-existant physical address (2)
[cn3005:11625] Failing at address: 0x7ff367224000
[cn3005:11626] Signal: Bus error (7)
[cn3005:11626] Signal code: Non-existant physical address (2)
[cn3005:11626] Failing at address: 0x7ff348ae3000
[cn3005:11625] [ 0] [cn3005:11626] [ 0] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11626] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5559407]
[cn3005:11626] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe55fcd0e]
[cn3005:11626] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b)[0x7fffe55fcf3b]
[cn3005:11626] [ 4] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11625] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5569407]
[cn3005:11625] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe560cd0e]
[cn3005:11625] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b

do you guys have any suggestions?

thanks,

yah

Hi, yah

It will cost longer if you want to collect more metrics.

Here are several questions need you answer:

  1. Are you using by using mpirun -np 4 -host $hostname,$slavename1,$slavename2,$slavename3 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./XXX

  2. Which toolkit are you using ?

  3. If possible, can you send me the sample you used for reproduce ?

  4. you also said : I’m just trying to profile one host. I have 2 K40 GPUs. What do you mean ? you do not need use mpirun ?

hi,

thanks for the reply.

i’m using cuda 7.5

here is my mpi command
mpirun -v -np NUM_MPI_PROCS --hostfile host.GPUs --mca btl_openib_want_fork_support 1 --mca btl openib,self --bind-to BIND --mca btl_openib_eager_limit EAGER_VALUE --mca btl_openib_max_send_size EAGER_VALUE runHPL.sh

here is the important portion of the runHPL.sh script. the below script works fine

case ${lrank} in
[0])
#uncomment next line to set GPU affinity of local rank 0
export CUDA_VISIBLE_DEVICES=0
#uncommen next line to set CPU affinity of local rank 0
numactl --cpunodebind=0 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-api-trace --print-gpu-trace
$HPL_DIR/xhpl_GPU
;;
[1])
#uncomment next line to set GPU affinity of local rank 2
export CUDA_VISIBLE_DEVICES=1
#uncomment next line to set CPU affinity of local rank 2
numactl --cpunodebind=1 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace
$HPL_DIR/xhpl_GPU
;;
esac

when i add these options to runHPL.sh
–metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i let it run for hours as supposed to just 15 minutes and i don’t see any output. i run on 4 nodes 2 gpus per node

thanks,

yah

Hi, yah

Thanks for the info.

As you said, the progress started, but you didn’t get results for hours.
I suppose this is a specific sample problem.

Have you ever tried to profile other samples to get these metrics, like 0_Simple/simpleMPI in the sdk, does it will also cost long time ?

Also I think you can tried not to add so many metrics in one time, just reduce some and see what happens.

PS: The latest toolkit already updated to 8.0

thanks for the suggestions. i’ll start small. i’ll test it on one node and then work my way up to 4

yah

I have a similar problem.
Running HPCG benchmark. One node with two M60 GPUs.
The command is

mpirun -np 2 nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

Without nvprof it takes about 2 minutes to finish.
With nvprof I am already waiting for 3 hours and it is still working.

I also (as Yah) need information about memory bandwidth usage.

BTW, the following command doesn’t have the problem and finishes fast:

mpirun -np 2 nvprof --annotate-mpi openmpi ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

–metrics option slows down profiling dramatically.

I do not see much correlation with number of metrics: adding only one metric after --metrics makes nvprof work tens or even hundreds times slower.
It is not specific to MPI applications.

Hi,

I also notice that the program is very slow when profiling it with nvprof. And I run with:

eventSet=tex0_cache_sector_queries,tex1_cache_sector_queries,tex2_cache_sector_queries,tex3_cache_sector_queries
nvprof --events $eventSet --log-file nvoutput_%p.csv --csv python3 main.py

Actually only 4 events.

Why the program is very slow?

I face the same issue. I am running nvprof on Jetson Xavier AGX. I am using it on a neural network inference with the tags -o and --analysis-metrics for exporting to visual profiler. The script has been running for more than 12 hours! Previously I used nvprof to export the timeline of this script without any issues.
Is it useful to specify the kernel option? How long is it estimated to profile all the metrics? Thanks in advance for any guidance.

NVIDIA Visual Profiler and nvprof use CUPTI for providing tracing and profiling information. All the profiling limitations mentioned in the CUPTI section Profiling Overhead apply to Visual Profiler and nvprof.

Listing few of those for quick reference:

  1. Profiling tools serialize all the kernels in the application, thus profiling may significantly change the overall performance characteristics of the application.
  2. When all the requested events or metrics cannot be collected in the single pass due to the hardware or software limitations, kernel or application is replayed multiple times for collection.
  3. Software events and metrics are expensive as these are collected using kernel instrumentation. Collection of software events and metrics is more expensive compared to the hardware events and metrics.

It is suggested to limit the scope of profiling to a small set of kernels. These can be achieved using the nvprof option --kernels. Refer to the Profiling Scope section of the Profiler User Guide for more details.

Thanks for the reply. I figured this out, and used the timeline to select which kernel I want more metrics on. Then I profiled all metrics for just that kernel. Good to know that this was the right direction.