Different index definition in nvml & CUDA runtime?

Hi fellow programmers,

I am working in a multi-GPU server.
I found for the same GPU index “1”, “nvmlDeviceGetHandleByIndex” will point me to one GPU while “cudaGetDeviceProperties” will give me another.
Is it expected?

Thanks,
Gengpu

Yes, it’s expected.

Thanks.

So how is the best way to find the “NextFreeGPU”?
After finding the free GPU through nvml API like “nvmlDeviceGetComputeRunningProcesses”, I should use pciDeviceId or busId to match against CUDA runtime?

Certainly that is one method, yes.

You can also use the CUDA_VISIBLE_DEVICES environment variable along with the nvml/(driver) ordering to parcel out GPUs programmatically.

It’s not clear to me what you’re trying to accomplish.
The definition of a “FreeGPU” is not well defined.
Are you building or working on a job scheduler?

I think it’s better to be explicit about assignment of GPUs, rather than trying to infer what a “FreeGPU” is based on ComputeRunningProcess. Such an inferential method could easily be exposed to race conditions, it seems to me.

Hi bro, thanks for the reply again.

My goal:

I have a 4-GPU (all Tesla K40) Svr, and I may have >30 calculation processes running. Each of the 30 processes may require one GPU usage (exclusively). It may use it for a few mins then release it.

Trying to avoid the hassle from introducing a separate “GPU Scheduler”, I hope to let the process decide by itself which GPU is “free” (as no other process running on it, something like “nvmlDeviceGetComputeRunningProcesses”). If no available free GPU, the process will just wait.

One more questions:
How can one process “lock” the GPU? Let’s say one process hope to lock one GPU for 5mins (in which there are many rounds of quick kernel launches). Think “cudaSetDevice” will not do it?

Thanks a lot for your help.
Gengpu

You can change the GPU “Compute Mode” from Default to “Exclusive Process” using the nvidia-smi tool. (start with nvidia-smi --help). When in Exclusive Process mode, a process will “own” the GPU it creates a context on, until it releases/destroys the context.

Other GPUs that come along later will not be able to create a context on this GPU. I don’t think this really solves the problem, because if another GPU tries to create a context on a “in-use” GPU, API calls will just fail. You could probably build an inferential mechanism around this somehow, but I think it’s going to be weak and difficult.

I doubt you’re going to find an easy solution without building some kind of scheduler/manager into your app. I don’t think it would be hard to keep a scoreboard of which GPUs are in use.

Thanks.
I did some further research, and found my jobs can share the GPUs, which means I don’t need a perfect scheduler as some overlapping is fine

In the end I’ll do this way:
1, nvmlDeviceGetComputeRunningProcesses to tell which GPU got 0 process running on it(nvml index)
2, nvmlDeviceGetPciInfo to tell the pci device id of the GPU
3, cudaGetDeviceProperties to find which GPU is matching with the pci device id (CUDA index)