nvprof shows DRAM throughput greater than theoretically possible
I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board). I have CUDA 9 installed. The command I used for profiling is like the following: [code] nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments> [/code] And the similar command for dram_read_throughput. The commands produce a log file in CSV format. The data I see in these log files confuses me. Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of [b]TB/s[/b]. Does this mean that [b]L1 or L2 caches[/b] are used? Here are some lines from the log files: [code] "Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg" "Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s "Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s [/code] By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.
I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).
I have CUDA 9 installed.
The command I used for profiling is like the following:
nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments>

And the similar command for dram_read_throughput. The commands produce a log file in CSV format.

The data I see in these log files confuses me.

Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of TB/s.

Does this mean that L1 or L2 caches are used?

Here are some lines from the log files:
"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s



By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.

#1
Posted 12/27/2017 02:10 AM   
Hi, peterbryz Thanks for reporting this. Regard to the metric value, it is an issue. It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file For below problem By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours. One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as: "In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."
Hi, peterbryz

Thanks for reporting this.

Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file


For below problem
By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.


One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
"In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."

#2
Posted 12/28/2017 03:32 AM   
Hi, Veraj, Thank you for your reply. I am profiling latest HPCG benchmark [url]http://www.hpcg-benchmark.org/software/index.html[/url] and Tensorflow HP benchmark [url]https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks[/url]. How many metrics can nvprof collect in one run without replaying?
Hi, Veraj,

Thank you for your reply.
I am profiling latest HPCG benchmark http://www.hpcg-benchmark.org/software/index.html and Tensorflow HP benchmark https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.

How many metrics can nvprof collect in one run without replaying?

#3
Posted 12/30/2017 12:18 PM   
Hi, pyotr777 Thanks for the info. We'll check if we can reproduce on our side. Any update, I will let you know. Best Regards VeraJ
Hi, pyotr777

Thanks for the info.
We'll check if we can reproduce on our side.

Any update, I will let you know.



Best Regards

VeraJ

#4
Posted 01/02/2018 02:47 AM   
Hi, pyotr777 I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254 But I fail to run it. root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17 start of application (8 OMP threads)... 2018-01-02 18:40:01.531 Problem setup... Setup time: 0.608166 sec Killed If I use GV100, the sample can run. So which command are you using on Tesla M60 ?
Hi, pyotr777


I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254



But I fail to run it.

root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

start of application (8 OMP threads)...
2018-01-02 18:40:01.531

Problem setup...
Setup time: 0.608166 sec
Killed


If I use GV100, the sample can run. So which command are you using on Tesla M60 ?

#5
Posted 01/02/2018 10:43 AM   
Hi, Veraj, For HPCG try use smaller problem size: [code]cp hpcg.dat_128x128x128_60 hpcg.dat[/code] To run HPCG without profiling it should be possible just to run the executable like you did. For profiling I use something like this: [code]$ nvprof --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17[/code] You can use installation scripts for HPCG on a new (cloud) Ubuntu machine: https://github.com/pyotr777/mlbenchmarks/blob/master/HPCG/ Please, do check the [b]Tensorflow HP benchmark[/b] also, as profiling works even worse for it. You could try the following command: [code]~/benchmarks/scripts/tf_cnn_benchmarks$ nvprof --metrics dram_read_throughput,dram_utilization,dram_write_throughput python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50[/code]
Hi, Veraj,

For HPCG try use smaller problem size:
cp hpcg.dat_128x128x128_60 hpcg.dat


To run HPCG without profiling it should be possible just to run the executable like you did.
For profiling I use something like this:
$ nvprof --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17


You can use installation scripts for HPCG on a new (cloud) Ubuntu machine:

https://github.com/pyotr777/mlbenchmarks/blob/master/HPCG/


Please, do check the Tensorflow HP benchmark also, as profiling works even worse for it.

You could try the following command:
~/benchmarks/scripts/tf_cnn_benchmarks$ nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64  --model=resnet50

#6
Posted 01/05/2018 06:25 AM   
Hi, pyotr777 I can reproduce the issue using Tensorflow HP benchmark "Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s I will update to the dev and let them have a check. Thanks for reporting this again !
Hi, pyotr777

I can reproduce the issue using Tensorflow HP benchmark

"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s

I will update to the dev and let them have a check.


Thanks for reporting this again !

#7
Posted 01/08/2018 10:49 AM   
Hi, Veraj, Thank you! May I expect an update to nvprof soon? Peter
Hi, Veraj,

Thank you!
May I expect an update to nvprof soon?

Peter

#8
Posted 01/09/2018 01:35 AM   
Hi, pyotr777 What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ? I'm afraid dev will fix in later toolkit release, not back to 9.0.
Hi, pyotr777

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

I'm afraid dev will fix in later toolkit release, not back to 9.0.

#9
Posted 01/09/2018 02:49 AM   
Hi, Veraj, > What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ? Nope. I'm looking forward an updated version of nvprof. Peter
Hi, Veraj,

> What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

Nope. I'm looking forward an updated version of nvprof.

Peter

#10
Posted 01/11/2018 03:04 AM   
Oh, that depends the cuda toolkit release schedule. I'm sorry I do not have the exact info.
Oh, that depends the cuda toolkit release schedule.
I'm sorry I do not have the exact info.

#11
Posted 01/11/2018 03:13 AM   
Scroll To Top

Add Reply