CUDA and MPI cmbination

Hi,

I am trying to achieve parallelism using mpi and cuda.

I am distributing array elements to several processes using mpi and the sum of array elements to be calculated by gpu.

I have this kernel.cu file

#include <stdio.h>

__global__ void add(int *devarray, int *devsum)

{

        int index = blockIdx.x * blockDim.x + threadIdx.x;

        devsum = devsum + devarray[index];

}

extern "C"

void run_kernel(int *array, int nelements)

{

        int  *devarray, *sum, *devsum;

cudaMalloc((void**) &devarray, sizeof(int)*nelements);

        cudaMalloc((void**) &devsum, sizeof(int));

        cudaMemcpy(devarray, array, sizeof(int)*nelements, cudaMemcpyHostToDevice);

        add<<<2, 3>>>(devarray, devsum);		

	cudaMemcpy(sum, devsum, sizeof(int), cudaMemcpyDeviceToHost);

printf("%d", sum);

        cudaFree(devarray);

	cudaFree(sum);

}

When I compile this program I get warning as

“variable sum is used before its value is set.” what is wrong here can someone tell?

And my mpi code is

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <sys/time.h>

#include <mpi.h>

#include <string.h>

#define  ARRAYSIZE	2000

#define  MASTER		0

int  *data;

int main (int argc, char *argv[])

{

int   numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize, namelen; 

int mysum;

long sum;

char myname[MPI_MAX_PROCESSOR_NAME];

MPI_Status status;

double start, stop, time;

double totaltime;

FILE *fp;

char line[128];

char element;

int n;

int k=0;

/***** Initializations *****/

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD,&taskid); 

MPI_Get_processor_name(myname, &namelen);

printf ("MPI task %d has started on host %s...\n", taskid, myname);

chunksize = (ARRAYSIZE / numtasks);

tag2 = 1;

tag1 = 2;

data = malloc(ARRAYSIZE * sizeof(int));

/***** Master task only ******/

if (taskid == MASTER){

  //read the integers from file

  fp=fopen("integers.dat", "r");

        if(fp != NULL){

         sum = 0;

         while(fgets(line, sizeof line, fp)!= NULL){

          fscanf(fp,"%d",&data[k]);

           sum = sum + data[k];

           k++;

          }

        }

printf("Initialized array sum = %d\n", sum);

		

	 /* Send each task its portion of the array - master keeps 1st part */

  offset = chunksize;

  for (dest=1; dest<numtasks; dest++) {

    MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

    MPI_Send(&data[offset], chunksize, MPI_INT, dest, tag2, MPI_COMM_WORLD);

    printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);

    offset = offset + chunksize;

    }

/* Master does its part of the work */

  offset = 0;

  run_kernel(data[offset + chunksize],chunksize);  //perform operation on GPU 

/* Wait to receive results from each task */

  for (i=1; i<numtasks; i++) {

    source = i;

    MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);

    MPI_Recv(&data[offset], chunksize, MPI_INT, source, tag2,

    MPI_COMM_WORLD, &status);

    }

/* Get final sum and print sample results */  

  MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);

  printf("\n*** Final sum= %d ***\n",sum);

}

if (taskid > MASTER) {

/* Receive my portion of array from the master task */

  source = MASTER;

  MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);

  MPI_Recv(&data[offset], chunksize, MPI_INT, source, tag2,MPI_COMM_WORLD, &status);

run_kernel(data[offset+chunksize],chunksize);

/* Send my results back to the master task */

  dest = MASTER;

  MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

  MPI_Send(&data[offset], chunksize, MPI_INT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);

} /* end of non-master */

}

The file integers.dat contains 2000 integers I am distributing it to several processes and then each process calculates it sum on gpu.

When I run the code i get error

mpirun -np 4 mpicudacomb

MPI task 0 has started on host node4

MPI task 1 has started on host node4

MPI task 3 has started on host node4

MPI task 2 has started on host node4

Initialized array sum = 9061

Sent 500 elements to task 1 offset= 500

Sent 500 elements to task 2 offset= 1000

Sent 500 elements to task 3 offset= 1500

147399408152775360--------------------------------------------------------------------------

mpirun has exited due to process rank 1 with PID 4538 on

node node4 exiting without calling “finalize”. This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).


[node4:04537] *** Process received signal ***

[node4:04537] Signal: Segmentation fault (11)

[node4:04537] Signal code: Address not mapped (1)

[node4:04537] Failing at address: 0xc

Can anyone help me with this?

Thanks

Hi,

There are so many problems in your code that I will list the most obvious ones in chronological order and will probably miss some:

    [*] It’s a detail and it has nothing to do with CUDA, but MPI_Send/MPI_Recv should be replaced here by MPI_Scatter(v)/MPI_Gather(v) which very purpose is doing exactly what you’re trying to achieve, in a much more effective way;

    [*] You transmit a chunk of data to each process and then launch a kernel to some data you haven’t got (“run_kernel(data[offset + chunksize],chunksize);”)

    [*] You pass a value to the processing function instead of a pointer to data. Your code should become “run_kernel(&data[offset],chunksize);”

    [*] You allocate and pass a result variable sum to the kernel, that you retrieve afterwards for printing, but never transmit it back to the main function, while you supposedly use it for a global reduction with MPI_Reduce. Moreover, you fail to initialise this local sum variable, which initialised value is used inside your kernel.

    [*] Your kernel itself is supposedly a reduction, but won’t work since reductions in CUDA are much more complex than this. You’d definitely benefit from having a look at this on the SDK code samples or on some proper CUDA training material.

    [*] The allocated data array could be freed before leaving the main function. This is not mandatory since the system will handle it, but this is good practice to clean up before to leave.

All in all, I would recommend you to make first a simple valid MPI-only version of you reduction code, and in parallel a simple working GPU-only version of it. When both are OK, you’ll be ready to merge them to get a MPI+GPU working version of your reduction.

BTW, I’m just amazed your code compiles altogether…

Hi,

Thanks for your prompt reply and help.

I actually want to use send and receive and in later version i will use Scatter and Gather functions of MPI.

This is my MPI code which distributes the array and calculates the sum using cpu.

#include "mpi.h"

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define  ARRAYSIZE	200000000

#define  MASTER		0

int  data[ARRAYSIZE];

int main(int argc, char* argv[])

{

int   numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize, namelen; 

int mysum;

long sum;

int update(int myoffset, int chunk, int myid);

char myname[MPI_MAX_PROCESSOR_NAME];

MPI_Status status;

double start = 0.0, stop = 0.0, time = 0.0;

double totaltime;

FILE *fp;

char line[128];

char element;

int n;

int k=0;

/***** Initializations *****/

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD,&taskid); 

MPI_Get_processor_name(myname, &namelen);

printf ("MPI task %d has started on host %s...\n", taskid, myname);

chunksize = (ARRAYSIZE / numtasks);

tag2 = 1;

tag1 = 2;

/***** Master task only ******/

if (taskid == MASTER){

fp=fopen("integers.dat", "r");

  if(fp != NULL){

   sum = 0;

   while(fgets(line, sizeof line, fp)!= NULL){

    fscanf(fp,"%d",&data[k]);

    sum = sum + data[k]; // calculate sum to verify later on

    k++;

   }

  }

/* Send each task its portion of the array - master keeps 1st part */

offset = chunksize;

for (dest=1; dest<numtasks; dest++) {

MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

MPI_Send(&data[offset], chunksize, MPI_INT, dest, tag2, MPI_COMM_WORLD);

printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);

offset = offset + chunksize;

}

/* Master does its part of the work */

offset = 0;

  mysum = update(offset, chunksize, taskid);

/* Wait to receive results from each task */

for (i=1; i<numtasks; i++) {

source = i;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);

MPI_Recv(&data[offset], chunksize, MPI_INT, source, tag2,MPI_COMM_WORLD, &status);

}

/* Get final sum and print sample results */  

MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);

printf("\n*** Final sum= %d ***\n",sum);

}  /* end of master section */

/***** Non-master tasks only *****/

if (taskid > MASTER) {

/* Receive my portion of array from the master task */

source = MASTER;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);

MPI_Recv(&data[offset], chunksize, MPI_INT, source, tag2,MPI_COMM_WORLD, &status);

mysum = update(offset, chunksize, taskid);

/* Send my results back to the master task */

dest = MASTER;

MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

MPI_Send(&data[offset], chunksize, MPI_INT, MASTER, tag2, MPI_COMM_WORLD);

MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);

} /* end of non-master */

MPI_Finalize();

}   

int update(int myoffset, int chunk, int myid) {

int i,j; 

int mysum = 0;

for(i=myoffset; i < myoffset + chunk; i++) {

mysum = mysum + data[i];

}

printf("Task %d has sum = %d\n",myid,mysum);

return(mysum);

}

This code works fine. Now I want to perform the addition done in update method on gpus.

So I am trying to mix cuda code in it. I did have a look at the reduction code provided by CUDA. Is there any way that I can use that code to link with this mpi code?

Also, the function run_kernel calls the kernel function in my kernel.cu file mentioned above.

My latest kernel.cu is

#include <stdio.h>

__global__ void add(int *devarray, int *devsum)

{

        int index = blockIdx.x * blockDim.x + threadIdx.x;

        devsum = devsum + devarray[index];

}

extern "C"

int * run_kernel(int array[],int nelements)

{

        int  *devarray, *sum, *devsum;

        printf("\nrun_kernel called..............");       

cudaMalloc((void**) &devarray, sizeof(int)*nelements);

        cudaMalloc((void**) &devsum, sizeof(int));

        cudaMemcpy(devarray, array, sizeof(int)*nelements, cudaMemcpyHostToDevice);

        add<<<2, 3>>>(devarray, devsum);		

	cudaMemcpy(sum, devsum, sizeof(int), cudaMemcpyDeviceToHost);

printf(" the sum is %d", sum);

        cudaFree(devarray);

	return sum;

}

Here is my output when I run above code -

MPI task 0 has started on host

MPI task 1 has started on host

MPI task 2 has started on host

MPI task 3 has started on host

Initialized array sum 9061Sent 500 elements to task 1 offset= 500

Sent 500 elements to task 2 offset= 1000

Sent 500 elements to task 3 offset= 1500

[ecm-c-l-207-004:04786] *** Process received signal ***

run_kernel called…

[node4:04786] Signal: Segmentation fault (11)

[node4:04786] Signal code: Invalid permissions (2)

[node4:04786] Failing at address: 0x8049828

[node4:04786] [ 0] [0xaf440c]

[node4:04786] [ 1] /usr/lib/libcuda.so(+0x13a0f6) [0xfa10f6]

[node4:04786] [ 2] /usr/lib/libcuda.so(+0x146912) [0xfad912]

[node4:04786] [ 3] /usr/lib/libcuda.so(+0x148094) [0xfaf094]

[node4:04786] [ 4] /usr/lib/libcuda.so(+0x13ca50) [0xfa3a50]

[node4:04786] [ 5] /usr/lib/libcuda.so(+0x11863c) [0xf7f63c]

[node4:04786] [ 6] /usr/lib/libcuda.so(+0x11d167) [0xf84167]

[node4:04786] [ 7] /usr/lib/libcuda.so(cuMemcpyDtoH_v2+0x64) [0xf74014]

[node4:04786] [ 8] /usr/local/cuda/lib/libcudart.so.4(+0x2037b) [0xcbe37b]

[node4:04786] [ 9] /usr/local/cuda/lib/libcudart.so.4(cudaMemcpy+0x230) [0xcf1360]

[node4:04786] [10] mpi_array(run_kernel+0x135) [0x8049559]

[node4:04786] [11] mpi_array(main+0x2f2) [0x8049046]

[node4:04786] [12] /lib/libc.so.6(__libc_start_main+0xe6) [0x2fece6]

[node4:04786] [13] mpi_array() [0x8048cc1]

[node4:04786] *** End of error message ***

Kernel returns sum 134530992 time taken by process 1 to recieve elements and caluclate own sum is = 0.276339 seconds

run_kernel called…

devsum is 3211264

the sum is 134532992

Kernel returns sum 134532992 time taken by process 2 to recieve elements and caluclate own sum is = 0.280452 seconds

run_kernel called…

devsum is 3211264

the sum is 134534992

Kernel returns sum 134534992 time taken by process 3 to recieve elements and caluclate own sum is = 0.285010 seconds


mpirun noticed that process rank 0 with PID 4786 on node ecm-c-l-207-004.uniwa.uwa.edu.au exited on signal 11 (Segmentation fault).

Any help would be appreciated

Thanks

Hi,
your MPI code looks fine so far. Now try to put together a sequential CUDA code doing a reduction on a chunk of memory that works. (well, sequential CUDA code is just antinomic since CUDA codes are parallel by essence, let’s say a non-MPI CUDA code)
Something that just have an array of int in memory an passes it to CUDA for a reduction and gets back the correct result. Once you’ve got this, you should got everything.
One comment in regard to your current code: the MPI code is OK, whereas the reduction kernel you gave is all wrong, but shouldn’t trigger any error from the system. It should just give some wrong result (as long as your data chunks are longer than 6 elements, which is the case according to the printed messages). Your 3 workers do the reduction and send back the result whereas process #0 crashes inside the kernel. You should check what is the difference in calling for task #0 compared to the others.
HTH