Problem while running parallel cuda process in AMBER

Hi,

I am trying to run an application in AMBER Molecular dynamics program on 2 CUDA cards in a parallel process. My OS is Ubuntu 10.04.4 LTS. When i checked for CUDA capable device using lspci | grep -i nvidia, i get

lspci | grep -i nvidia

14:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)
15:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)

the output of nvcc -V is:-

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

when i ran nvidia-smi, i get : -

+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.84   Driver Version: 304.84         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0    78W / 225W |   0%    9MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    77W / 225W |   0%    9MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

so, i guess all the CUDA capable devices are getting detected in the machine.

when i am running an AMBER application using a single GPU card (pmemd.cuda), the process is running successfully. The output of nvidia-smi is:-

Sat Jan 30 01:23:11 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.84   Driver Version: 304.84         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0   184W / 225W |  26% 1396MB / 5375MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    78W / 225W |   0%   10MB / 5375MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      4566  pmemd.cuda                                          1383MB  |
+-----------------------------------------------------------------------------+

But when i try to run the process in parallel using 2 GPU cards (pmemd.cuda.MPI), i am getting this error message :-


cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 5 mambo_35283 caused collective abort of all ranks
exit status of rank 1: return code 255

I posted the problem in AMBER mailing list, but since no reply came, i guess the problem source lies somewhere in CUDA installation.

What could be wrong here?

Thanks

what is the complete MPI command line you are using to launch the amber executable?

the command line to run cuda MPI process i am using is :-

mpirun -np 2 pmemd.cuda.MPI -O -i input_file -p mut.prmtop -c restart_file -o test.out -r test.rst -x test.crd

and before running this command, i set the ENV variable CUDA_VISIBLE_DEVICES to 0,1

export CUDA_VISIBLE_DEVICES=0,1

You might want to try creating a machine/host file that explicitly calls out that both MPI ranks are to be launched on the same node:

mpirun -hostfile ~/.amber.hosts.2 ...

where your ~/.amber.hosts.2 is something like:

localhost
localhost

Thanks, i tried this, but still getting the same error, now repeated twice like this :-

cudaGetDeviceCount failed no CUDA-capable device is detected
cudaGetDeviceCount failed no CUDA-capable device is detected
rank 1 in job 10  mambo_35283   caused collective abort of all ranks
  exit status of rank 1: return code 255 
rank 0 in job 10  mambo_35283   caused collective abort of all ranks
  exit status of rank 0: return code 255

i used this command:-

mpirun -machinefile ~/.amber.host.2 -np 2 pmemd.cuda.MPI -O -i input_file -p prm.prmtop -c restart.rst

I must mention that some time back, i was able to run the cuda MPI process successfully. This problem popped up suddenly and i am unable to figure out what is wrong. The CUDA and kernel version seem to be consistent, all drivers are in their places, nothing seems to be wrong with nvidia-smi and deviceQuery output.

so deviceQuery output looks OK?

What happens if you run deviceQuery as an mpi job with one process?