I need to profiling host-to-device and device-to-host memory copy consume,but I get a simple result that only contains a simple timeline instead of multi threading. The application I used was a complicated tensorflow model developed by google for image caption. And nvvp gives the indication that no GPU is used. But when running, I check it use command “nvidia-smi”, it shows that GPU is used when running.
Another simple matrix multiply application written in tensorflow give detailed information, I wonder why this happens.
Thanks for your reply.
It seems when running this complicated model, only interface calling can be profiled. The model is supported by GPU device when not profiling, so I guess the prossible reson is that detailed information is not accessible cause the profile result show the timeline of calling socalled ‘cuDevicePrimaryCtxRetain’, which I can’t understand what exactlly it was.
Is it possible that the running actually do not trigger any kernel launch ?
You can use nvprof --kernels XXX --analysis-metrics -o analysis.nvprof ./application to generate result file and then import to nvvp to check if any result.