nvprof shows DRAM throughput greater than theoretically possible

I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).
I have CUDA 9 installed.
The command I used for profiling is like the following:

nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments>

And the similar command for dram_read_throughput. The commands produce a log file in CSV format.

The data I see in these log files confuses me.

Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of TB/s.

Does this mean that L1 or L2 caches are used?

Here are some lines from the log files:

"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s

By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.

Hi, peterbryz

Thanks for reporting this.

Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file

For below problem
By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.

One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
“In “application replay” mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory.”

Hi, Veraj,

Thank you for your reply.
I am profiling latest HPCG benchmark [url]http://www.hpcg-benchmark.org/software/index.html[/url] and Tensorflow HP benchmark [url]https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks[/url].

How many metrics can nvprof collect in one run without replaying?

Hi, pyotr777

Thanks for the info.
We’ll check if we can reproduce on our side.

Any update, I will let you know.

Best Regards

VeraJ

Hi, pyotr777

I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254

But I fail to run it.

root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

start of application (8 OMP threads)…
2018-01-02 18:40:01.531

Problem setup…
Setup time: 0.608166 sec
Killed

If I use GV100, the sample can run. So which command are you using on Tesla M60 ?

Hi, Veraj,

For HPCG try use smaller problem size:

cp hpcg.dat_128x128x128_60 hpcg.dat

To run HPCG without profiling it should be possible just to run the executable like you did.
For profiling I use something like this:

$ nvprof --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

You can use installation scripts for HPCG on a new (cloud) Ubuntu machine:

Please, do check the Tensorflow HP benchmark also, as profiling works even worse for it.

You could try the following command:

~/benchmarks/scripts/tf_cnn_benchmarks$ nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64  --model=resnet50

Hi, pyotr777

I can reproduce the issue using Tensorflow HP benchmark

“Tesla M60 (0)”,“void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)”,452,“dram_write_throughput”,“Device Memory Write Throughput”,0.000000B/s,43576.681300GB/s,1020.850923GB/s

I will update to the dev and let them have a check.

Thanks for reporting this again !

Hi, Veraj,

Thank you!
May I expect an update to nvprof soon?

Peter

Hi, pyotr777

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

I’m afraid dev will fix in later toolkit release, not back to 9.0.

Hi, Veraj,

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

Nope. I’m looking forward an updated version of nvprof.

Peter

Oh, that depends the cuda toolkit release schedule.
I’m sorry I do not have the exact info.