GPU utilization of *completed* processes?

I have a cluster of systems with GPUs running SGE. I would like
to produce some graphs for how utilized the GPUs of the whole
cluster are.

I was hoping to hook a job epilogue script into SGE in order to
record accounting information for the just completed GPU process
(which was a child of the still-running SGE job process that
will run my hook). But I don’t see how to get this “historic”
information off the GPU.

My ‘plan B’ is to poll ‘nvidia-smi -q -d ACCOUNTING’ to get a
snapshot of the utilization of processes running on the GPUs,
say at 60s intervals, log that information and then have a
second script collate and aggregate the logs to produce a result
(e.g. “last month, the 24 GPUs in the cluster were 99% utilized”).

I’ve got the beginnings of a script to do this, (which I’ve pasted
at report-gpu-utilization - Pastebin.com ) but polling doesn’t seem
accurate or elegant. Does anybody know how to lift information
regarding a completed process? Thanks!

study the nvidia-smi manpage and also nvidia-smi --help

–query-accounted-apps may be what you are looking for. Study:

nvidia-smi --help-query-accounted-apps

Something like this:

sudo nvidia-smi -am 1

nvidia-smi --query-accounted-apps=gpu_name,pid --format=csv

You may also want to set persistence mode on the GPU(s) in question. I’m not sure. It was set on my GPU when I ran the experiment. With a bit of experimentation you can quickly discover if it matters or not.