I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).
I have CUDA 9 installed.
The command I used for profiling is like the following:
nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments>
And the similar command for dram_read_throughput. The commands produce a log file in CSV format.
The data I see in these log files confuses me.
Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of TB/s.
Does this mean that L1 or L2 caches are used?
Here are some lines from the log files:
"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s
By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.