Building the cluster (integrating John with CUDA support and OpenMPI) was straightforward with only one minor glitch.
I am operating with:
R21 (release), REVISION: 5.0, GCID: 7273100, BOARD: ardbeg, EABI: hard, DATE:
Wed Jun 8 04:19:09 UTC 2016
After you compile John in order to run it via cluster (I am NFS mounting the john “run” directory so that all nodes in the cluster can read/write and execute it’s contents) you must add a new directory to your LD environment.
Normally that is fine, but there is an error in the shared libraries for 21.50
I compiled and ran john using standalone (single node) mode and everything went well. When I tried to execute the binaries under OpenMPI framework (via mpirun) it spewed errors about not finding a library. I updated ld.so.conf.d/nvidia-tegra.conf to add this:
/usr/lib/arm-linux-gnueabihf/tegra
added below
/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib
However when I ran ldconfig, it produced an error!
root@gpu02:/etc/ld.so.conf.d# ldconfig
/sbin/ldconfig.real: /usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib/libcudnn.so.6.5 is not a symbolic link
root@gpu02:/etc/ld.so.conf.d# cd /usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ls -l cudnn
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 9308614 Apr 26 21:49 libcudnn_static.a
You have to nuke the two incorrectly installed shared libraries i.e.
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# rm libcudnn.so libcudnn.so.6.5
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ln -s libcudnn.so.6.5.48 libcudnn.so.6.5
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ln -s libcudnn.so.6.5.48 libcudnn.so
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ls -l cudnn
lrwxrwxrwx 1 root root 18 May 25 01:02 libcudnn.so → libcudnn.so.6.5.48
lrwxrwxrwx 1 root root 18 May 25 01:02 libcudnn.so.6.5 → libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 9308614 Apr 26 21:49 libcudnn_static.a
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ldconfig
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib#
No errors!
Adding the new path to ld.so.conf solved the issue with not being able to load the shared library and the other cluster nodes were able to execute john normally.
The below status output is running on an md5 hash.
mpirun: Forwarding signal 10 to job
1 0g 0:00:21:49 57.16% 2/3 (ETA: 19:56:44) 0g/s 67059p/s 67059c/s 67059C/s Bakenttnekab2…Haafssfaah2
3 0g 0:00:21:49 74.18% 2/3 (ETA: 15:47:58) 0g/s 65704p/s 65704c/s 65704C/s Novelet?..Outrepasserons?
2 0g 0:00:21:49 37.14% 2/3 (ETA: 20:17:18) 0g/s 54131p/s 54131c/s 54131C/s sudsy?..toatoa?
4 0g 0:00:21:49 47.60% 2/3 (ETA: 20:04:24) 0g/s 59701p/s 59701c/s 59701C/s Dortohg…Nerace
ubuntu@gpu01:/master/mpi_tests$ ./stress
- true
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
gpu02 19:51:01 up 23:31, 1 user, load average: 0.28, 0.26, 0.20
gpu01 19:51:01 up 6:02, 6 users, load average: 0.85, 0.48, 0.32
gpu03 15:51:01 up 5:47, 2 users, load average: 0.46, 0.62, 0.46
gpu04 19:51:00 up 15:02, 1 user, load average: 0.18, 0.24, 0.22
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
Hello, Cluster! Python process 1 of 4 on gpu01.
Hello, Cluster! Python process 2 of 4 on gpu02.
Hello, Cluster! Python process 3 of 4 on gpu03.
Hello, Cluster! Python process 4 of 4 on gpu04.
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/cpi
Process 1 of 4 is on gpu01
Process 2 of 4 is on gpu02
Process 4 of 4 is on gpu04
Process 3 of 4 is on gpu03
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.002959
- true
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
gpu02 19:51:03 up 23:31, 1 user, load average: 0.34, 0.27, 0.21
gpu03 15:51:03 up 5:47, 2 users, load average: 0.46, 0.62, 0.46
gpu01 19:51:03 up 6:02, 6 users, load average: 0.94, 0.51, 0.33
gpu04 19:51:03 up 15:02, 1 user, load average: 0.33, 0.27, 0.22
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
Hello, Cluster! Python process 2 of 4 on gpu02.
Hello, Cluster! Python process 1 of 4 on gpu01.
Hello, Cluster! Python process 3 of 4 on gpu03.
Hello, Cluster! Python process 4 of 4 on gpu04.
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/cpi
Process 1 of 4 is on gpu01
Process 2 of 4 is on gpu02
Process 3 of 4 is on gpu03
Process 4 of 4 is on gpu04
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.003526
- true
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
gpu02 19:51:05 up 23:31, 1 user, load average: 0.34, 0.27, 0.21
gpu01 19:51:05 up 6:02, 6 users, load average: 0.94, 0.51, 0.33
gpu03 15:51:05 up 5:47, 2 users, load average: 0.82, 0.70, 0.48
gpu04 19:51:05 up 15:02, 1 user, load average: 0.33, 0.27, 0.22
- mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
^Cmpirun: killing job…
ubuntu@gpu01:/master/mpi_tests$
ETA I saw I need to get ntpd running… sigh.