need help

Hello!

How to parallelize with CUDA this function

__global__  void mul(int *res,int *NUM1, int *NUM2,int w,int e)

	{    

        int i,j,carry,temp;

        for(i=0;i<e;i++)

	{

	 carry=0;

	 for(j=0;j<w;j++)

	{

	 temp=NUM2[i]*NUM1[j]+res[i+j]+carry;

	 carry=temp/10;

	 res[i+j]=temp-(carry*10);

        }

	 res[i+j]=carry;

	} 

        }

NUM1 and NUM2 1d array.

Hello,

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)