Fail to launch CUDA-MPS

Hi everyone.

I am trying to work with CUDA-MPS, but I don´t be able to start CUDA-MPS server. I have an application which create 4 client MPI processes. Both the CUDA-MPS sever and the clients are in the same machine. The machine has two GPUs: a K20c with C.C. 3.5 and a GTX580 with C.C. 2.0. I want to launch the 4 MPI processes on the GPU K20c, which has 0 device ID. According to the CUDA-MPS Documentation, I execute the following steps:

  1. export CUDA_VISIBLE_DEVICES=0
  2. nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
  3. nvidia-cuda-mps-control -d

Once the previous commands are executed, then I launch my application with the 4 client MPI processes with the command mpirun -np 4 ./simpleMPI 8. My application finishes without errors, but the log file of CUDA-MPS shows the following lines:

[2015-10-23 10:59:02.787 Other 32742] MPS server failed to start
[2015-10-23 10:59:02.787 Other 32742] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher Tesla/Quadro GPU.
[2015-10-23 10:59:02.787 Other 32742] MPS is not supported on multi-GPU configurations. Please use CUDA_VISIBLE_DEVICES to select the device on which the MPS server should be run.

Why this occurs?

Many thanks in advance

  1. Which CUDA version are you using?

  2. As a diagnostic step, after performing this:

export CUDA_VISIBLE_DEVICES=0

run the deviceQuery sample code, and confirm that only the K20c is reported. (I expect that to be the case given your description, but this is a useful confirmation.)

  1. please provide the output of the following command on your system:
nvidia-smi
  1. When you run the following command as you have indicated:
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

are you doing so with root privilege?

I just ran a simple test, it seems to be working for me. I have a RHEL 6.2 node with CUDA 7.0 that has 3 GPUs in it:

$ nvidia-smi
Fri Oct 23 05:58:41 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.46     Driver Version: 346.46         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla C2075         Off  | 0000:03:00.0     Off |                    0 |
| 30%   51C    P0     0W / 225W |      9MiB /  5375MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  NVS 310             Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   42C    P0    N/A /  N/A |      3MiB /   511MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40c          Off  | 0000:82:00.0     Off |                    0 |
| 23%   38C    P0    65W / 235W |     23MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1              C   Not Supported                                         |
+-----------------------------------------------------------------------------+
$

Note that the only cc3.5+ GPU here is the K40c, and also note that all GPUs are currently in Default compute mode. Also note that the K40c GPU is enumerated here as device 2, but if I were to check the enumeration under CUDA (for example, by running deviceQuery), it will be enumerated as device 0. This distinction is important in the following discussion.

Following the MPS instructions here:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

for the use case covered in section 5.1.1 (multi-user setup), I created the following scripts:

start_as_root.bash:

#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

Note that in the above script, I am restricting the CUDA device to 0 (corresponding to the CUDA enumeration order), but the device I select for modification of the compute mode is device 2 (corresponding to the nvidia-smi enumeration order).

stop_as_root.bash:

#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT

The above two scripts are used to start and stop the MPS server control daemon.

I also have a script to run the test.

test.bash:

#!/bin/bash
/usr/lib64/openmpi/bin/mpirun -n 2 simpleMPI/simpleMPI

When I run the following sequence, everything seems to work correctly:

$ su
Password:
# ./start_as_root.bash
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./test.bash
Running on 2 nodes
Average of square roots is: 0.667279
PASSED
$ su
Password:
# ./stop_as_root.bash
Set compute mode to DEFAULT for GPU 0000:82:00.0.
All done.
# exit
exit
$

As a proof-point, we could observe what happens if I run the test script with the GPU set to EXCLUSIVE_PROCESS mode but the daemon is not running:

$ su
Password:
# nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./test.bash
Running on 2 nodes
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 10
Test FAILED
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 10.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$

Other notes:

  1. The enumeration order of nvidia-smi does not depend on the CUDA runtime, and follows the PCI enumeration order. The enumeration order of the CUDA runtime follows a heuristic that generally tries to order the “most powerful” GPU first.

  2. On my node I was using OpenMPI, as that is conveniently installed as part of the RHEL 6.2 distribution. I copied the contents of the

/usr/local/cuda-7.0/samples/0_Simple/simpleMPI

directory to a local directory, then built the code with the following command:

nvcc -o simpleMPI -I/usr/include/openmpi-x86_64 -I/usr/local/cuda/samples/common/inc -L/usr/lib64/openmpi/lib -lmpi_cxx  simpleMPI.cpp simpleMPI.cu

Hi txbob!! I have done all steps which you have advised me. But it seems to not work. I post the output of the steps.

The output of my nvidia-smi

Sat Oct 24 13:11:34 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.29     Driver Version: 340.29         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 580     Off  | 0000:03:00.0     N/A |                  N/A |
| 43%   50C    P0    N/A /  N/A |      4MiB /  1535MiB |     N/A   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 38%   46C    P0    56W / 225W |     11MiB /  4799MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+

The GPU K20c has the 1 ID according to nvidia-smi, right?. According this ID, I have modified your bash script

start_as_root.bash

export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 1 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

If I execute deviceQuery executable. Only the GPU K20c is detected by CUDA (the GPU K20c has 0 ID for CUDA). As root, I execute this script and its output is:

Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:04:00.0.
All done.

I have copied the SDK simpleMPI code to a local directory. I compile this code with the Makefile provided in this example. Note: My CUDA version is 6.5. As not root user, I execute this example with the command:

mpirun -n 2 ./simpleMPI

But the output is:

mpirun -n 2 ./simpleMPI
Running on 2 nodes
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 2
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 2
Test FAILED
Test FAILED
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 3717 on
node mistral exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[mistral:03715] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[mistral:03715] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

What is in the cuda-mps log file now?

What is the result of running:

uname -a

?

The cuda-mps log files show:

server.log

[2015-10-24 14:13:39.529 Other  3798] New client 3999 connected
[2015-10-24 14:13:39.549 Other  3798] New client 4001 connected
[2015-10-24 14:13:39.551 Other  3798] New client 4002 connected
[2015-10-24 14:13:40.660 Other  3798] New client 4002 connected
[2015-10-24 14:13:40.660 Other  3798] New client 4001 connected
[2015-10-24 14:13:40.662 Other  3798] Client 4001 disconnected
[2015-10-24 14:13:40.662 Other  3798] Client 4002 disconnected
[2015-10-24 14:13:40.666 Other  3798] Client 4002 disconnected
[2015-10-24 14:13:40.666 Other  3798] Client 4001 disconnected
[2015-10-24 14:13:40.669 Other  3798] Client 3999 disconnected
[2015-10-24 14:13:49.367 Other  3798] Waiting for current clients to finish
[2015-10-24 14:13:49.367 Other  3798] Exit

The cuda-mps log files show:

server.log

[2015-10-24 14:13:39.529 Other  3798] New client 3999 connected
[2015-10-24 14:13:39.549 Other  3798] New client 4001 connected
[2015-10-24 14:13:39.551 Other  3798] New client 4002 connected
[2015-10-24 14:13:40.660 Other  3798] New client 4002 connected
[2015-10-24 14:13:40.660 Other  3798] New client 4001 connected
[2015-10-24 14:13:40.662 Other  3798] Client 4001 disconnected
[2015-10-24 14:13:40.662 Other  3798] Client 4002 disconnected
[2015-10-24 14:13:40.666 Other  3798] Client 4002 disconnected
[2015-10-24 14:13:40.666 Other  3798] Client 4001 disconnected
[2015-10-24 14:13:40.669 Other  3798] Client 3999 disconnected
[2015-10-24 14:13:49.367 Other  3798] Waiting for current clients to finish
[2015-10-24 14:13:49.367 Other  3798] Exit

The control.log

[2015-10-24 14:13:39.529 Control  3794] Accepting connection...
[2015-10-24 14:13:39.529 Control  3794] NEW CLIENT 3999 from user 1005: Server already exists
[2015-10-24 14:13:39.549 Control  3794] Accepting connection...
[2015-10-24 14:13:39.549 Control  3794] NEW CLIENT 4001 from user 1005: Server already exists
[2015-10-24 14:13:39.551 Control  3794] Accepting connection...
[2015-10-24 14:13:39.551 Control  3794] NEW CLIENT 4002 from user 1005: Server already exists
[2015-10-24 14:13:40.660 Control  3794] Accepting connection...
[2015-10-24 14:13:40.660 Control  3794] NEW CLIENT 4002 from user 1005: Server already exists
[2015-10-24 14:13:40.660 Control  3794] Accepting connection...
[2015-10-24 14:13:40.660 Control  3794] NEW CLIENT 4001 from user 1005: Server already exists
[2015-10-24 14:13:49.367 Control  3794] Accepting connection...
[2015-10-24 14:13:49.367 Control  3794] NEW UI
[2015-10-24 14:13:49.367 Control  3794] Cmd:quit
[2015-10-24 14:13:49.948 Control  3794] Server 3798 exited with status 0
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.cc
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.cb
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.ca
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c9
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c8
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c7
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c6
[2015-10-24 14:13:49.949 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c5
[2015-10-24 14:13:49.950 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c4
[2015-10-24 14:13:49.950 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c3
[2015-10-24 14:13:49.950 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c2
[2015-10-24 14:13:49.950 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c1
[2015-10-24 14:13:49.950 Control  3794] Removed Shm file at /dev/shm/cuda.shm.ed6.c0

When I execute uname -a the output is:

Linux mistral 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Your control.log should have only had two “NEW CLIENT” messages in it. Likewise your server.log should have only had 2 client connected and 2 client disconnected messages in it.

It seems to me that something else is running.

Try:

  1. reboot machine.
  2. Delete MPS log files
  3. Rerun the test
  4. Paste the new log files here.

Hi txbob, I have rebooted my machine, I have deleted the MPS log file and I have rerunned the test. The log files show:

server.log

[2015-10-26 08:19:59.257 Other  1553] Start
[2015-10-26 08:19:59.551 Other  1553] New client 1552 connected
[2015-10-26 08:19:59.742 Other  1553] New client 1567 connected
[2015-10-26 08:19:59.742 Other  1553] New client 1566 connected
[2015-10-26 08:20:00.941 Other  1553] New client 1567 connected
[2015-10-26 08:20:00.941 Other  1553] New client 1566 connected
[2015-10-26 08:20:00.943 Other  1553] Client 1567 disconnected
[2015-10-26 08:20:00.943 Other  1553] Client 1566 disconnected
[2015-10-26 08:20:00.956 Other  1553] Client 1566 disconnected
[2015-10-26 08:20:00.956 Other  1553] Client 1567 disconnected
[2015-10-26 08:20:00.966 Other  1553] Client 1552 disconnected
[2015-10-26 08:20:12.913 Other  1553] Waiting for current clients to finish
[2015-10-26 08:20:12.913 Other  1553] Exit

control.log

[2015-10-26 08:18:07.551 Control  1434] Start
[2015-10-26 08:19:57.328 Control  1434] Accepting connection...
[2015-10-26 08:19:57.328 Control  1434] NEW CLIENT 1552 from user 1005: Server is not ready, push client to pending list
[2015-10-26 08:19:57.329 Control  1434] Starting new server 1553 for user 1005
[2015-10-26 08:19:59.550 Control  1434] Accepting connection...
[2015-10-26 08:19:59.550 Control  1434] NEW SERVER 1553: Ready
[2015-10-26 08:19:59.741 Control  1434] Accepting connection...
[2015-10-26 08:19:59.741 Control  1434] NEW CLIENT 1567 from user 1005: Server already exists
[2015-10-26 08:19:59.741 Control  1434] Accepting connection...
[2015-10-26 08:19:59.741 Control  1434] NEW CLIENT 1566 from user 1005: Server already exists
[2015-10-26 08:20:00.941 Control  1434] Accepting connection...
[2015-10-26 08:20:00.941 Control  1434] NEW CLIENT 1567 from user 1005: Server already exists
[2015-10-26 08:20:00.941 Control  1434] Accepting connection...
[2015-10-26 08:20:00.941 Control  1434] NEW CLIENT 1566 from user 1005: Server already exists
[2015-10-26 08:20:12.913 Control  1434] Accepting connection...
[2015-10-26 08:20:12.913 Control  1434] NEW UI
[2015-10-26 08:20:12.913 Control  1434] Cmd:quit
[2015-10-26 08:20:13.473 Control  1434] Server 1553 exited with status 0
[2015-10-26 08:20:13.473 Control  1434] Removed Shm file at /dev/shm/cuda.shm.611.cc
[2015-10-26 08:20:13.474 Control  1434] Removed Shm file at /dev/shm/cuda.shm.611.cb
[2015-10-26 08:20:13.474 Control  1434] Removed Shm file at /dev/shm/cuda.shm.611.ca
[2015-10-26 08:20:13.474 Control  1434] Removed Shm file at /dev/shm/cuda.shm.611.c9
[2015-10-26 08:20:13.474 Control  1434] Removed Shm file at /dev/shm/cuda.shm.611.c8

It’s very strange because I launch the test with only 2 processes. However the control.log file shows more clients. Could it be the CUDA version?. In this document http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf when start the mps daemon and launch the application the methods are different if you don´t use CUDA 7.0. But I don´t completely understand the steps in that document.

Hi txbob, I have upgraded to CUDA 7.0 and your steps work fine for me. The MPS configuration for previous CUDA versions looks more difficult. Now, the log files shows:

server.log

[2015-10-26 15:20:57.072 Other  1995] Start
[2015-10-26 15:20:57.304 Other  1995] New client 1993 connected
[2015-10-26 15:20:57.304 Other  1995] New client 1994 connected
[2015-10-26 15:20:57.407 Other  1995] Client 1993 disconnected
[2015-10-26 15:20:57.414 Other  1995] Client 1994 disconnected
[2015-10-26 15:21:06.321 Other  1995] Waiting for current clients to finish
[2015-10-26 15:21:06.321 Other  1995] Exit

control.log

[2015-10-26 15:20:47.866 Control  1989] Start
[2015-10-26 15:20:55.715 Control  1989] Accepting connection...
[2015-10-26 15:20:55.715 Control  1989] NEW CLIENT 1994 from user 1005: Server is not ready, push client to pending list
[2015-10-26 15:20:55.715 Control  1989] Starting new server 1995 for user 1005
[2015-10-26 15:20:55.715 Control  1989] Accepting connection...
[2015-10-26 15:20:55.715 Control  1989] NEW CLIENT 1993 from user 1005: Server is not ready, push client to pending list
[2015-10-26 15:20:57.303 Control  1989] Accepting connection...
[2015-10-26 15:20:57.303 Control  1989] NEW SERVER 1995: Ready
[2015-10-26 15:21:06.320 Control  1989] Accepting connection...
[2015-10-26 15:21:06.320 Control  1989] NEW UI
[2015-10-26 15:21:06.321 Control  1989] Cmd:quit
[2015-10-26 15:21:06.737 Control  1989] Server 1995 exited with status 0

Thank you for your help :-)