weird problem

Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];

This should be quite a simple problem, my CPU version code is :
for (i = 0;i < N; i++) v1[i] -= dv * v2[i];

my CUDA version is :

global void vec_sub_mul_kernel(double T, double S, double V)
{
T[blockIdx.x
blockDim.x+threadIdx.x] -= S[blockIdx.x
blockDim.x+threadIdx.x] * V;
}

void vec_sub_mul(double *tgt, double *src, double dv, int n)
{
cudaMemset(d_vx,0,n * sizeof(double));
cudaMemset(d_vy,0,n * sizeof(double));

cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);
vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);
cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);

}

These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes
don’t, the difference is not very big though.

PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?

Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];

This should be quite a simple problem, my CPU version code is :
for (i = 0;i < N; i++) v1[i] -= dv * v2[i];

my CUDA version is :

global void vec_sub_mul_kernel(double T, double S, double V)
{
T[blockIdx.x
blockDim.x+threadIdx.x] -= S[blockIdx.x
blockDim.x+threadIdx.x] * V;
}

void vec_sub_mul(double *tgt, double *src, double dv, int n)
{
cudaMemset(d_vx,0,n * sizeof(double));
cudaMemset(d_vy,0,n * sizeof(double));

cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);
vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);
cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);

}

These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes
don’t, the difference is not very big though.

PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?

I bet you’re not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your “the difference is not very big.” symptom.

As an aside, your kernel will be much slower than the CPU. You’re completely PCIe limited… the transfer overhead will kill you.

I bet you’re not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your “the difference is not very big.” symptom.

As an aside, your kernel will be much slower than the CPU. You’re completely PCIe limited… the transfer overhead will kill you.

Do this on the host – provided you don’t have data in GPU memory already… :)

Do this on the host – provided you don’t have data in GPU memory already… :)

SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?

Crankie: If I do this on host, I wouldn’t have to post this topic right?:)

Any other opinion?

SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?

Crankie: If I do this on host, I wouldn’t have to post this topic right?:)

Any other opinion?

Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.

Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.

avidday: for the host code I just simply compiled it with:
g++ -o xxx xxx.cpp;

for the gpu code I compiled it with

nvcc -arch=sm_13 -o xxx xxx.cu

And the results, which are after many many rounds of computation, is like this:

CPU resutls---------------GPU resutls
92096229.070354---------------92096231.872298
3771338.827391---------------3608260.193571
95.167448 ---------------95.159868
80741.074339 --------------- 77249.741954
… …

If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!

avidday: for the host code I just simply compiled it with:
g++ -o xxx xxx.cpp;

for the gpu code I compiled it with

nvcc -arch=sm_13 -o xxx xxx.cu

And the results, which are after many many rounds of computation, is like this:

CPU resutls---------------GPU resutls
92096229.070354---------------92096231.872298
3771338.827391---------------3608260.193571
95.167448 ---------------95.159868
80741.074339 --------------- 77249.741954
… …

If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!