How to use CUPTI to get average instruction execution time?

bz1 · March 14, 2018, 5:16am

I would like to get the average instruction execution time. I think I need to use CUPTI to do this (if it is even possible).

I compiled and ran 4 of the cupti examples (callback_metric, callback_timestamp, pc_sampling, sass_source_map)

I also read through the CUPTI.pdf and I looked through the cupti.h, cupti_events.h, cupti_metrics.h.

The sass_source_map came closest to what I needed. I was able to correlate the SASS instructions
(using nvdisasm) back to the source code (I happened to need that). I can now see the number
of times that each instruction is executed … but I also need the average duration too.

Any ideas how to do this?

–Bob

Device Name: TITAN V
SOURCE_LOCATOR SrcLctrId 2, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 1
FUCTION functionId 1, moduleId 9, name _Z9transposePfPKf
INSTRUCTION_EXECUTION srcLctr 2, corr 202, functionId 1, pc 0
notPredOffthread_inst_executed 0, thread_inst_executed 15872, inst_executed 496

INSTRUCTION_EXECUTION srcLctr 2, corr 202, functionId 1, pc 10
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

SOURCE_LOCATOR SrcLctrId 3, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 14
INSTRUCTION_EXECUTION srcLctr 3, corr 202, functionId 1, pc 20
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

INSTRUCTION_EXECUTION srcLctr 3, corr 202, functionId 1, pc 30
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

SOURCE_LOCATOR SrcLctrId 4, File C:/Projects/cupti_sass2src/cupti_sass2src/kernel.cu Line 15
INSTRUCTION_EXECUTION srcLctr 4, corr 202, functionId 1, pc 40
notPredOffthread_inst_executed 15872, thread_inst_executed 15872, inst_executed 496

BulatZiganshin · March 14, 2018, 9:52pm

what is the execution time? latency or throughput?

bz1 · March 15, 2018, 1:13am

Well, I would have accepted the average number of clock cycles to execute the instruction.
I assume that would include any latency.

Sanjiv.Satoor · March 15, 2018, 4:53pm

I would like to get the average instruction execution time.
Are you looking for the average per instruction or the average for a kernel across all instructions?

bz1 · March 15, 2018, 9:43pm

The average per instruction.

Do you have an idea that would get me the data I want?

bz1 · March 16, 2018, 1:40am

Hello … NVidia … Could someone please answer my question?

Sanjiv.Satoor · March 16, 2018, 8:12am

We do not support any metric for average execution time per instruction.

But you can look at the PC sampling feature which gives the number of samples for each instruction with various stall reasons. Using this information you can pinpoint portions of your kernel that are introducing latencies and the reason for the latency.

This is supported on GPU devices with compute capability 5.2 and higher (excluding mobile devices).

For CUPTI refer [url]CUPTI :: CUDA Toolkit Documentation or for Visual Profiler refer [url]Profiler :: CUDA Toolkit Documentation

bz1 · March 20, 2018, 6:03pm

I’ve already explored the callback_metric, callback_timestamp, sass_source_map and pc_sampling examples.
I wish people would stop trying to predict what I want to do with the data. I’m not interested in using the pc_sampling data to identify reasons for latency. Is there any sort of surrogate for the average execution time per instruction using CUPTI? I realize there is no direct metric for what I am looking for. I was hoping CUPTI would help me derive it indirectly (if need be).