Hi, peterbryz
Thanks for reporting this.
Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file
For below problem
By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.
One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
"In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."
Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file
For below problem
By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.
One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
"In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."
Hi, pyotr777
I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254
But I fail to run it.
root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17
start of application (8 OMP threads)...
2018-01-02 18:40:01.531
Problem setup...
Setup time: 0.608166 sec
Killed
If I use GV100, the sample can run. So which command are you using on Tesla M60 ?
Hi, pyotr777
I can reproduce the issue using Tensorflow HP benchmark
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s
I will update to the dev and let them have a check.
Thanks for reporting this again !
Hi, pyotr777
What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?
I'm afraid dev will fix in later toolkit release, not back to 9.0.
Thanks for reporting this.
Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file
For below problem
By the way, if I don't use --replay-mode application option, profiling a program that runs less than a minute takes hours.
One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
"In "application replay" mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory."
Thanks for the info.
We'll check if we can reproduce on our side.
Any update, I will let you know.
Best Regards
VeraJ
I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254
But I fail to run it.
root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17
start of application (8 OMP threads)...
2018-01-02 18:40:01.531
Problem setup...
Setup time: 0.608166 sec
Killed
If I use GV100, the sample can run. So which command are you using on Tesla M60 ?
I can reproduce the issue using Tensorflow HP benchmark
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",452,"dram_write_throughput","Device Memory Write Throughput",0.000000B/s,43576.681300GB/s,1020.850923GB/s
I will update to the dev and let them have a check.
Thanks for reporting this again !
What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?
I'm afraid dev will fix in later toolkit release, not back to 9.0.
I'm sorry I do not have the exact info.