CUPTI mapping of SM to instance

Sorry for what is most likely a newbie question, but I’m trying to understand how to associate per instance event/metric counters with a particular SM. I have multiple, long running kernels executing simultaneously. I know which SM(s) the kernel is executing on using “mov.u32 %0, %smid;” from within the kernel. I’ve been through the CUPTI API docs and haven’t seen any way to correlate instance to SMID. Any pointers greatly appreciated.

" understand how to associate per instance event/metric counters with a particular SM"

could you explain exactly what per instance event/metric counters mean; and why it is important to your case?

For example, here is the IPC metric (output of nvidia-smi --query-metrics for Tesla)

ipc_instance: Instructions executed per cycle for a single multiprocessor

I am using CUPTI to retrieve the IPC metric per instance (i.e. per SM). From the data I’ve gathered there is not a straight mapping of “instance” to “SM ID”. Here is what I am observing, for my two kernels, each of which launches 4 thread blocks of 512 threads:

Metric ipc_instance:
Instance 0: 0.000000
Instance 1: 0.000000
Instance 2: 0.154155
Instance 3: 0.000000
Instance 4: 0.000000
Instance 5: 0.336802
Instance 6: 0.000000
Instance 7: 0.153201
Instance 8: 0.340822
Instance 9: 0.000000
Instance 10: 0.153438
Instance 11: 0.343361
Instance 12: 0.000000
Instance 13: 0.156066
Instance 14: 0.341185

Kernel mappings reported by my kernels:
“response” thread block ran on SM 14
“response” thread block ran on SM 13
“response” thread block ran on SM 12
“response” thread block ran on SM 11
“request” thread block ran on SM 10
“request” thread block ran on SM 9
“request” thread block ran on SM 8
“request” thread block ran on SM 7

Based on these results, four thread blocks of one kernel (with IPC roughly 0.15) ran on instances [2, 7, 10, 13] and four thread blocks of the other kernel (with IPC roughly 0.34) ran on instances [5, 8, 11, 14].

I could determine the correlation with enough experimentation, however I would think this would be queryable via CUPTI or the driver APIs.

how do you intend to use the data? how can it be useful?

you also do not know in what order instances were seated
perhaps 2, 7, 10, 13 were seated before 5, 8, 11, 14, yielding them some advantage
your thoughts?

I’m guessing you meant:

nvprof --query-metrics

AFAIK some metrics are per-SM, whereas others are gathered from multiple SMs and are a statistical estimate of the behavior of your code across the device. AFAIK there is no way to identify the specific SM from which these measurements were derived, but I may be wrong. That is probably a question for Greg @ NV. He comes by these forums occasionally, perhaps weekly.

Using “instance” when you mean “SM” or “threadblock” certainly confused me, at first.

Hi, yes you are correct, that output came from nvprof, sorry about that. And my understanding is the same as yours, some metrics can be calculated per SM, some per DRAM channel, etc.

The term “instance” is used by the CUPTI library specifically. When profiling a metric that can be captured on a per SM basis, there are the same number of instances as there are SMs (but the numbering scheme is different, which is the gist of my question). When profiling a metric that can be captured on a per DRAM channel basis, there are an equivalent number of CUPTI “instances” (i.e. 6 channels/instances on my Tesla).

A little more context here, my “long running” kernels run forever, rendering the “normal” profiling tools mostly useless (the kernel and application replay scenarios used by the profiler aren’t possible in my use case).

Needed the info so wrote a test case to determine the mapping. For reference, here it is for a Tesla K40m:

SMID     CUPTI Instance
----     --------------
  0            0
  1            3
  2            6
  3            9
  4           12
  5            1
  6            4
  7            7
  8           10
  9           13
 10            2
 11            5
 12            8
 13           11
 14           14