nvprof shows DRAM throughput greater than theoretically possible
Hi, peterbryz Thanks for reporting this. Regard to the metric value, it is an issue. It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file For below problem By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours. One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as: "In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."
Hi, peterbryz

Thanks for reporting this.

Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file


For below problem
By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.


One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
"In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."

#1
Posted 12/28/2017 03:32 AM   
Hi, pyotr777 Thanks for the info. We'll check if we can reproduce on our side. Any update, I will let you know. Best Regards VeraJ
Hi, pyotr777

Thanks for the info.
We'll check if we can reproduce on our side.

Any update, I will let you know.



Best Regards

VeraJ

#2
Posted 01/02/2018 02:47 AM   
Hi, pyotr777 I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254 But I fail to run it. root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17 start of application (8 OMP threads)... 2018-01-02 18:40:01.531 Problem setup... Setup time: 0.608166 sec Killed If I use GV100, the sample can run. So which command are you using on Tesla M60 ?
Hi, pyotr777


I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254



But I fail to run it.

root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

start of application (8 OMP threads)...
2018-01-02 18:40:01.531

Problem setup...
Setup time: 0.608166 sec
Killed


If I use GV100, the sample can run. So which command are you using on Tesla M60 ?

#3
Posted 01/02/2018 10:43 AM   
Hi, pyotr777 I can reproduce the issue using Tensorflow HP benchmark "Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s I will update to the dev and let them have a check. Thanks for reporting this again !
Hi, pyotr777

I can reproduce the issue using Tensorflow HP benchmark

"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s

I will update to the dev and let them have a check.


Thanks for reporting this again !

#4
Posted 01/08/2018 10:49 AM   
Hi, pyotr777 What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ? I'm afraid dev will fix in later toolkit release, not back to 9.0.
Hi, pyotr777

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

I'm afraid dev will fix in later toolkit release, not back to 9.0.

#5
Posted 01/09/2018 02:49 AM   
Oh, that depends the cuda toolkit release schedule. I'm sorry I do not have the exact info.
Oh, that depends the cuda toolkit release schedule.
I'm sorry I do not have the exact info.

#6
Posted 01/11/2018 03:13 AM   
Scroll To Top

Add Reply