First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)

How to parallelize with CUDA this function

[code]__global__ void mul(int *res,int *NUM1, int *NUM2,int w,int e)

{

int i,j,carry,temp;

for(i=0;i<e;i++)

{

carry=0;

for(j=0;j<w;j++)

{

temp=NUM2[i]*NUM1[j]+res[i+j]+carry;

carry=temp/10;

res[i+j]=temp-(carry*10);

}

res[i+j]=carry;

}

}[/code]

NUM1 and NUM2 1d array.

How to parallelize with CUDA this function

NUM1 and NUM2 1d array.

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)

First i have to point out that you are using res[i+j] before is set.Second you have this linr res[i+j]=carry; after the second loop. If the code you provided is correct it is going to be not so effective on the gpu. This is happening because it appears there is a dependence of carry variable at j. You also appear to have a problem with the (i+j) index. On the cpu there is an implicit order for which i and j are executed. For example you can have i=5 and j=12. This will access give the result at i+j=17. In the same time you can have i=6 and j=11, i=7 and j=10 and so on. On th cpu we know that the i=5 will be executed before i=6.

So you have i depending on i-1 and for a given j you have j depending of j-1 so I think it is not possible to make it parallel (assuming that i understood correct the algorithm)