Suggest you use nvvp(Visual Profiler) to get what you need.
This is an UI tool designed based on nvprof.
Details you can refer Profiler :: CUDA Toolkit Documentation
Sorry for the previous answer did not satisfy you. Here is the result checked with dev:
In the next CUDA Toolkit release we are planning a nvprof enhancement to support combined metrics and tracing output. But note that with metric collection kernel execution will be serialized and kernel execution time will not be accurate.