I have a cluster of systems with GPUs running SGE. I would like
to produce some graphs for how utilized the GPUs of the whole
cluster are.
I was hoping to hook a job epilogue script into SGE in order to
record accounting information for the just completed GPU process
(which was a child of the still-running SGE job process that
will run my hook). But I don’t see how to get this “historic”
information off the GPU.
My ‘plan B’ is to poll ‘nvidia-smi -q -d ACCOUNTING’ to get a
snapshot of the utilization of processes running on the GPUs,
say at 60s intervals, log that information and then have a
second script collate and aggregate the logs to produce a result
(e.g. “last month, the 24 GPUs in the cluster were 99% utilized”).
I’ve got the beginnings of a script to do this, (which I’ve pasted
at report-gpu-utilization - Pastebin.com ) but polling doesn’t seem
accurate or elegant. Does anybody know how to lift information
regarding a completed process? Thanks!