using all 4 GPUs in S1070 from multi-core cpu? how

[indent][/indent]Hi, I’m trying to speed up my program by using multiple GPUs, so my current program is like

main()

{

[indent][/indent] …

[indent][/indent] CPU_func

[indent][/indent] ...

}

CPU_func()

{

for (i=0;i<n;i++)

{

   ...

GPU_function(a[i],b[i],c[i],...)  

[indent][/indent] [indent][/indent] …

[indent][/indent] }

[indent][/indent] GPU_collect(res)

[indent][/indent] …

}

GPU_function(a,b,c,…)

{

[indent][/indent] memcpy a,b,c… to GPU_a,GPU_b,GPU_,…

[indent][/indent] do stuff on gpu

}

GPU_collect(res)

{

[indent][/indent] memcpy GPU_res to res

}

I dont want to use MPI in the main program because later on I might need MPI for running my program on clusters of GPU,

So I’m wondering what’s the easiest way of spreading all the same jobs to different GPUs without using MPI?

Thank you very much[list=1]

[indent][/indent]

You will have to spawn multiple threads in some way. I’m not terribly familiar with MPI, but if you can use pthreads within a single MPI process, that will work. Start as many threads as GPUs, and have each one call cudaSetDevice with a different number.

That doesn’t make much sense. In the CUDA multi-GPU paradigm, each GPU requires a context and each context must be bound to an independent host CPU thread. There will probably be at least some requirement that those threads can communicate with one another. In a cluster of multi-GPU nodes, each GPU requires a context, each context must be bound to an independent thread and the threads probably have to communicate with one another to some degreee. It effectively the same requirements. You can use MPI in both cases (probably almost the same MPI code). On a local node, the message passing is probably happening via some fast shared memory IPC, and between nodes it is happening over the wire, but you code doesn’t have to care.

You could, of course, use some other host CPU threading mechanism at a node level, but you don’t have to. MPI will (and does) work well in that sort of situation.

I see, my concern is that the way people construct GPU clusters is that they have say 30 nodes, each node is a 8 core CPU (with shared memory) that connects to S1070( which has 4 GPUs ), so I don’t know how to use MPI to handle this kind of 2-level parallelization because the node id will be the same for 4GPUs associated with the same cpu node

You launch one MPI process per GPU context - each process has a unique node number and its own GPU context. It doesn’t matter whether you have four MPI processes with four GPU contexts on a single machine, or many spread across several cluster nodes. Each is uniquely identified in the MPI communicator.

The only downside in the single host case is that running four processes with MPI messaging is a little bit more CPU and resource intensive that running four host threads using a lightweight host thread manager (like pthreads). But in this context it really isn’t going to make much difference.

I see, I’ll try that, thank you very much

I’ve tried this, but I have difficulty specifying a unique node number for each core, since they all have the same ip address, and I’m using mpich2

This function will assign a different card to each MPI process.

void  assignDeviceToProcess()

{

	   char	 host_name[MPI_MAX_PROCESSOR_NAME];

	   char (*host_names)[MPI_MAX_PROCESSOR_NAME];

	   int n, namelen, color, rank, nprocs;

	   size_t bytes;

	   MPI_Comm nodeComm;

	   int dev, err1;

	   struct cudaDeviceProp deviceProp;

	   /* Check if the device has been alreasy assigned */

	   if(first_time)

		{

		 first_time=0;

	   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

	   MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

	   MPI_Get_processor_name(host_name,&namelen);

	   bytes = nprocs * sizeof(char[MPI_MAX_PROCESSOR_NAME]);

	   host_names = (char (*)[MPI_MAX_PROCESSOR_NAME]) malloc(bytes);

	   strcpy(host_names[rank], host_name);

	   for (n=0; n<nprocs; n++)

	   {

		MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESSOR_NAME, MPI_CHAR, n, MPI_COMM_WORLD);

	   }

	   qsort(host_names, nprocs,  sizeof(char[MPI_MAX_PROCESSOR_NAME]), stringCmp);

	   color = 0;

	   for (n=0; n<nprocs; n++)

		  {

			if(n>0&&strcmp(host_names[n-1], host_names[n])) color++;

			if(strcmp(host_name, host_names[n]) == 0) break;

		   }

	   MPI_Comm_split(MPI_COMM_WORLD, color, 0, &nodeComm);

	   MPI_Comm_rank(nodeComm, &myrank);

	   printf ("Assigning device %d  to process on node %s rank %d \n",myrank,  host_name, rank );

	   /* Assign device to MPI process*/

	   cudaSetDevice(myrank);

	  }

}

Thanks first.

But my program will abort at the first time I call MPI_Comm_Rank, so I dont know if your solution will work. I was wondering if I should change my makefile

To be more specific.

in my program(f90), the first few lines is

write(0,*)“before init”

call MPI_INIT_(ierr)

write(0,*)“before comm rank”

call MPI_COMM_RANK(MPI_COMM_WORLD,i_node,ierr)

write(0,*)“before comm size”

call MPI_COMM_SIZE(MPI_COMM_WORLD,n_node,ierr)

and I ran it with the following command line argument

/mpich/dir/bin/mpiexec -machinefile file1 -n 2 myprogram

where file1 contains

mymachinename

mymachinename

I got the following error message

before init

before init

before comm rank

before comm rank

Fatal error in MPI_Comm_rank: Invalid communicator, error stack:

MPI_Comm_rank(107): MPI_Comm_rank(comm=0x5b, rank=0x7fff7b5f6a04) failed

MPI_Comm_rank(65).: Invalid communicator[cli_0]: aborting job:

Fatal error in MPI_Comm_rank: Invalid communicator, error stack:

MPI_Comm_rank(107): MPI_Comm_rank(comm=0x5b, rank=0x7fff7b5f6a04) failed

MPI_Comm_rank(65).: Invalid communicator

Fatal error in MPI_Comm_rank: Invalid communicator, error stack:

MPI_Comm_rank(107): MPI_Comm_rank(comm=0x5b, rank=0x7ffff2898ca4) failed

MPI_Comm_rank(65).: Invalid communicator[cli_1]: aborting job:

Fatal error in MPI_Comm_rank: Invalid communicator, error stack:

MPI_Comm_rank(107): MPI_Comm_rank(comm=0x5b, rank=0x7ffff2898ca4) failed

MPI_Comm_rank(65).: Invalid communicator

rank 0 in job 8 tesla0.stanford.edu_46491 caused collective abort of all ranks

exit status of rank 0: return code 1

make: *** [record1.H] Error 1

Are you confident your MPI setup works? Can you build and run a simple MPI “hello world”:

#include <stdio.h> 

#include <stdlib.h>

#include <mpi.h> 

int main(argc, argv)

int argc;

char *argv[];

{

   int rank, size, length;

   char name[BUFSIZ];

MPI_Init(&argc, &argv);

   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   MPI_Comm_size(MPI_COMM_WORLD, &size);

   MPI_Get_processor_name(name, &length);

printf("%s: hello world from process %d of %d\n", name, rank, size);

MPI_Finalize();

exit(0);

}

which should run and give you something like this:

[avidday@n0007 ~]$ mpiexec -np 6 src/mpihello

n0007: hello world from process 0 of 6

n0007: hello world from process 1 of 6

n0006: hello world from process 2 of 6

n0006: hello world from process 3 of 6

n0001: hello world from process 5 of 6

n0001: hello world from process 4 of 6

Looks like MPI works for C, but not fortran~

I’ll investigate, thank you very mcuh

When I`ve started this function my programm failed:

[gpu4:30972] *** Process received signal ***

[gpu4:30972] Signal: Segmentation fault (11)

[gpu4:30972] Signal code:  (128)

[gpu4:30972] Failing at address: (nil)

[gpu4:30973] *** Process received signal ***

[gpu4:30973] Signal: Segmentation fault (11)

[gpu4:30973] Signal code:  (128)

[gpu4:30973] Failing at address: (nil)

[gpu4:30972] [ 0] /lib64/libpthread.so.0 [0x2b1bc2482e60]

[gpu4:30972] [ 1] ./simpleMPI(cstring_cmp+0x7) [0x404465]

[gpu4:30972] [ 2] /lib64/libc.so.6 [0x2b1bc26c1287]

[gpu4:30972] [ 3] /lib64/libc.so.6(qsort+0x140) [0x2b1bc26c1510]

[gpu4:30972] [ 4] ./simpleMPI(main+0x18a) [0x403ee2]

[gpu4:30972] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b1bc26ac8a4]

[gpu4:30972] [ 6] ./simpleMPI(__gxx_personality_v0+0x81) [0x403ca9]

[gpu4:30972] *** End of error message ***

[gpu4:30973] [ 0] /lib64/libpthread.so.0 [0x2b067049be60]

[gpu4:30973] [ 1] ./simpleMPI(cstring_cmp+0x7) [0x404465]

[gpu4:30973] [ 2] /lib64/libc.so.6 [0x2b06706da287]

[gpu4:30973] [ 3] /lib64/libc.so.6(qsort+0x140) [0x2b06706da510]

[gpu4:30973] [ 4] ./simpleMPI(main+0x18a) [0x403ee2]

[gpu4:30973] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b06706c58a4]

[gpu4:30973] [ 6] ./simpleMPI(__gxx_personality_v0+0x81) [0x403ca9]

[gpu4:30973] *** End of error message ***

mpirun noticed that job rank 0 with PID 30972 on node gpu4 exited on signal 11 (Segmentation fault).

1 additional process aborted (not shown)

I`ve made another function:

bool SetCudaDevice2(int nodeCount, int nodeID)

{

    int device_count=0;

    cudaGetDeviceCount(&device_count);

    int set_device= nodeID%2;

    if(cudaSetDevice(set_device) != cudaSuccess)

    {

        return false;

    }

    return true;

}

Where nodeCount - count of MPI-nodes and nodeID - is ID of MPI-node.