need help
Hello!
How to parallelize with CUDA this function

[code]__global__ void mul(int *res,int *NUM1, int *NUM2,int w,int e)
{
int i,j,carry,temp;
for(i=0;i<e;i++)
{
carry=0;
for(j=0;j<w;j++)
{
temp=NUM2[i]*NUM1[j]+res[i+j]+carry;
carry=temp/10;
res[i+j]=temp-(carry*10);
}
res[i+j]=carry;
}
}[/code]


NUM1 and NUM2 1d array.
Hello!

How to parallelize with CUDA this function



__global__  void mul(int *res,int *NUM1, int *NUM2,int w,int e)

{

int i,j,carry,temp;

for(i=0;i<e;i++)

{

carry=0;

for(j=0;j<w;j++)

{

temp=NUM2[i]*NUM1[j]+res[i+j]+carry;

carry=temp/10;

res[i+j]=temp-(carry*10);

}

res[i+j]=carry;

}

}






NUM1 and NUM2 1d array.

#1
Posted 05/05/2012 06:21 PM   
Hello,

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)
Hello,



First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.



So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)

#2
Posted 05/05/2012 07:27 PM   
Scroll To Top