How to use CUPTI to get average instruction execution time?

I would like to get the average instruction execution time. I think I need to use CUPTI to do this (if it is even possible).

I compiled and ran 4 of the cupti examples (callback_metric, callback_timestamp, pc_sampling, sass_source_map)

I also read through the CUPTI.pdf and I looked through the cupti.h, cupti_events.h, cupti_metrics.h.

The sass_source_map came closest to what I needed. I was able to correlate the SASS instructions
(using nvdisasm) back to the source code (I happened to need that). I can now see the number
of times that each instruction is executed … but I also need the average duration too.

Any ideas how to do this?

–Bob

Device Name: TITAN V
SOURCE_LOCATOR SrcLctrId 2, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 1
FUCTION functionId 1, moduleId 9, name _Z9transposePfPKf
INSTRUCTION_EXECUTION srcLctr 2, corr 202, functionId 1, pc 0
notPredOffthread_inst_executed 0, thread_inst_executed 15872, inst_executed 496

INSTRUCTION_EXECUTION srcLctr 2, corr 202, functionId 1, pc 10
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

SOURCE_LOCATOR SrcLctrId 3, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 14
INSTRUCTION_EXECUTION srcLctr 3, corr 202, functionId 1, pc 20
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

INSTRUCTION_EXECUTION srcLctr 3, corr 202, functionId 1, pc 30
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

SOURCE_LOCATOR SrcLctrId 4, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 15
INSTRUCTION_EXECUTION srcLctr 4, corr 202, functionId 1, pc 40
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

what is the execution time? latency or throughput?

Well, I would have accepted the average number of clock cycles to execute the instruction.
I assume that would include any latency.

I would like to get the average instruction execution time.
Are you looking for the average per instruction or the average for a kernel across all instructions?

The average per instruction.

Do you have an idea that would get me the data I want?

Hello … NVidia … Could someone please answer my question?

We do not support any metric for average execution time per instruction.

But you can look at the PC sampling feature which gives the number of samples for each instruction with various stall reasons. Using this information you can pinpoint portions of your kernel that are introducing latencies and the reason for the latency.

This is supported on GPU devices with compute capability 5.2 and higher (excluding mobile devices).

For CUPTI refer [url]CUPTI :: CUDA Toolkit Documentation or for Visual Profiler refer [url]Profiler :: CUDA Toolkit Documentation

I’ve already explored the callback_metric, callback_timestamp, sass_source_map and pc_sampling examples.
I wish people would stop trying to predict what I want to do with the data. I’m not interested in using the pc_sampling data to identify reasons for latency. Is there any sort of surrogate for the average execution time per instruction using CUPTI? I realize there is no direct metric for what I am looking for. I was hoping CUPTI would help me derive it indirectly (if need be).