MPI (MPICH2) and GTX580 using MPI with CUDA

I’m afraid I am probably not going to like the answer to this but…

We are developing a model which uses CUDA kernels to accelerate certain calculations. Eventually we will need to scale this to run across a large number of EC2 GPU compute cluster instances but for now I am trying to just use a two node (three GPU) windows-based cluster to test bed the code. In one machine (running server 2008 R2) we have a GTX570 plus a tesla C2050 (in TCC mode) and in the other machine (win 7) we have a 3Gb GTX580. Now we can run a simple Hello World exe (i.e. no cuda) across both hosts using mpiexec -hosts 2 server 5 win7 5 \someshare\mpiHW.exe without a hitch. However trying to run the simpleMPI example from the SDK results in an error 38 on the win 7 host - i.e. it does not recognise the GTX580 (the code itself runs on the win7 machine without a hitch, just not via MPI).

So… has nvidia restricted MPI to the tesla TCC driver? If so I will seriously have to consider whether I should be developing code for this platform!

Hopefully someone will confirm it is more likely to be an error in our MPI setup as they have this working with the geforce cards across multiple nodes - here’s hoping.

It is not an MPI problem, it is related to the WDDM driver model in Win 7.
When you are trying to use a remote connection, unless you use the TCC mode (or go back to XP…) or do something similar to what is explained in this post (The Official NVIDIA Forums | NVIDIA)
you will not be able to access the GPU.

It will work just fine in Linux.

Thanks for your help. That makes sense.

As we will be using linux gpu clusters on EC2 I guess the simplest solution would be to switch to linux for this test cluster (I have no real preference otherthan I like to use the VS2010 IDE). I’ll check out the thread you suggest and then decide.

We use mpich2 (version 1.4.1p1) for our Linux-GPU cluster (4 PC with 4xGTX 580 each) and it works like a charm.

FYI: If you use MVAPICH2 1.8rc1, you can mpi_send/recv direct from device pointers. And when the send/recv is to/from devices on the same host, it will automatically use IPC to transfer data directly from GPU to GPU without involving the host. This can provide significant performance gains.

Thanks for confirming this will work with linux. Actually I have just been given the opportunity to purchase a couple of M2050s going cheap - I guess these will work using the TCC mode without resorting to a linux OS installation. As we need to run some simulations in a hurry for a grant proposal I will probably forego that pleasure for now if at all possible ;-)