weird problem
Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];

This should be quite a simple problem, my CPU version code is :
for (i = 0;i < N; i++) v1[i] -= dv * v2[i];

my CUDA version is :

__global__ void vec_sub_mul_kernel(double *T, double *S, double V)
{
T[blockIdx.x*blockDim.x+threadIdx.x] -= S[blockIdx.x*blockDim.x+threadIdx.x] * V;
}

void vec_sub_mul(double *tgt, double *src, double dv, int n)
{
cudaMemset(d_vx,0,n * sizeof(double));
cudaMemset(d_vy,0,n * sizeof(double));

cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);
vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);
cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);

}


These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes
don't, the difference is not very big though.

PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?
Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];



This should be quite a simple problem, my CPU version code is :

for (i = 0;i < N; i++) v1[i] -= dv * v2[i];



my CUDA version is :



__global__ void vec_sub_mul_kernel(double *T, double *S, double V)

{

T[blockIdx.x*blockDim.x+threadIdx.x] -= S[blockIdx.x*blockDim.x+threadIdx.x] * V;

}



void vec_sub_mul(double *tgt, double *src, double dv, int n)

{

cudaMemset(d_vx,0,n * sizeof(double));

cudaMemset(d_vy,0,n * sizeof(double));



cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);

cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);

vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);

cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);



}





These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes

don't, the difference is not very big though.



PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?

#1
Posted 11/13/2010 07:34 PM   
Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];

This should be quite a simple problem, my CPU version code is :
for (i = 0;i < N; i++) v1[i] -= dv * v2[i];

my CUDA version is :

__global__ void vec_sub_mul_kernel(double *T, double *S, double V)
{
T[blockIdx.x*blockDim.x+threadIdx.x] -= S[blockIdx.x*blockDim.x+threadIdx.x] * V;
}

void vec_sub_mul(double *tgt, double *src, double dv, int n)
{
cudaMemset(d_vx,0,n * sizeof(double));
cudaMemset(d_vy,0,n * sizeof(double));

cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);
vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);
cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);

}


These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes
don't, the difference is not very big though.

PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?
Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];



This should be quite a simple problem, my CPU version code is :

for (i = 0;i < N; i++) v1[i] -= dv * v2[i];



my CUDA version is :



__global__ void vec_sub_mul_kernel(double *T, double *S, double V)

{

T[blockIdx.x*blockDim.x+threadIdx.x] -= S[blockIdx.x*blockDim.x+threadIdx.x] * V;

}



void vec_sub_mul(double *tgt, double *src, double dv, int n)

{

cudaMemset(d_vx,0,n * sizeof(double));

cudaMemset(d_vy,0,n * sizeof(double));



cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);

cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);

vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);

cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);



}





These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes

don't, the difference is not very big though.



PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?

#2
Posted 11/13/2010 07:34 PM   
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your "the difference is not very big." symptom.

As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.

That would explain your "the difference is not very big." symptom.



As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.

#3
Posted 11/13/2010 10:15 PM   
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your "the difference is not very big." symptom.

As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.

That would explain your "the difference is not very big." symptom.



As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.

#4
Posted 11/13/2010 10:15 PM   
[quote name='SPWorley' date='14 November 2010 - 03:45 AM' timestamp='1289686537' post='1146153']
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your "the difference is not very big." symptom.

As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.
[/quote]


Do this on the host -- provided you don't have data in GPU memory already.. :)
[quote name='SPWorley' date='14 November 2010 - 03:45 AM' timestamp='1289686537' post='1146153']

I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.

That would explain your "the difference is not very big." symptom.



As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.







Do this on the host -- provided you don't have data in GPU memory already.. :)

#5
Posted 11/15/2010 05:53 AM   
[quote name='SPWorley' date='14 November 2010 - 03:45 AM' timestamp='1289686537' post='1146153']
I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.
That would explain your "the difference is not very big." symptom.

As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.
[/quote]


Do this on the host -- provided you don't have data in GPU memory already.. :)
[quote name='SPWorley' date='14 November 2010 - 03:45 AM' timestamp='1289686537' post='1146153']

I bet you're not compiling with -arch sm_13 or sm_20, so double precision math is being done as single precision.

That would explain your "the difference is not very big." symptom.



As an aside, your kernel will be much slower than the CPU. You're completely PCIe limited.. the transfer overhead will kill you.







Do this on the host -- provided you don't have data in GPU memory already.. :)

#6
Posted 11/15/2010 05:53 AM   
SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?

Crankie: If I do this on host, I wouldn't have to post this topic right?:)

Any other opinion?
SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?



Crankie: If I do this on host, I wouldn't have to post this topic right?:)



Any other opinion?

#7
Posted 11/15/2010 02:32 PM   
SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?

Crankie: If I do this on host, I wouldn't have to post this topic right?:)

Any other opinion?
SPWorley: I did compile with sm_13, otherwise the difference will be huge. And by "PCIe limited " you mean the memory copy right?



Crankie: If I do this on host, I wouldn't have to post this topic right?:)



Any other opinion?

#8
Posted 11/15/2010 02:32 PM   
Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.
Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.

#9
Posted 11/15/2010 02:49 PM   
Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.
Can you be specific about the magnitude of the differences when they occur? Depending on how you are compiling your host code, it might be that it is using x87 FPU instructions, which are internally 80 bit with rounding. That could potentially produce small differences between the 64 bit calculations on the device and the host for a non-trivial number of elements in the difference.

#10
Posted 11/15/2010 02:49 PM   
avidday: for the host code I just simply compiled it with:
g++ -o xxx xxx.cpp;

for the gpu code I compiled it with

nvcc -arch=sm_13 -o xxx xxx.cu

And the results, which are after many many rounds of computation, is like this:

CPU resutls---------------GPU resutls
92096229.070354---------------92096231.872298
3771338.827391---------------3608260.193571
95.167448 ---------------95.159868
80741.074339 --------------- 77249.741954
... ...

If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!
avidday: for the host code I just simply compiled it with:

g++ -o xxx xxx.cpp;



for the gpu code I compiled it with



nvcc -arch=sm_13 -o xxx xxx.cu



And the results, which are after many many rounds of computation, is like this:



CPU resutls---------------GPU resutls

92096229.070354---------------92096231.872298

3771338.827391---------------3608260.193571

95.167448 ---------------95.159868

80741.074339 --------------- 77249.741954

... ...



If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!

#11
Posted 11/15/2010 03:46 PM   
avidday: for the host code I just simply compiled it with:
g++ -o xxx xxx.cpp;

for the gpu code I compiled it with

nvcc -arch=sm_13 -o xxx xxx.cu

And the results, which are after many many rounds of computation, is like this:

CPU resutls---------------GPU resutls
92096229.070354---------------92096231.872298
3771338.827391---------------3608260.193571
95.167448 ---------------95.159868
80741.074339 --------------- 77249.741954
... ...

If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!
avidday: for the host code I just simply compiled it with:

g++ -o xxx xxx.cpp;



for the gpu code I compiled it with



nvcc -arch=sm_13 -o xxx xxx.cu



And the results, which are after many many rounds of computation, is like this:



CPU resutls---------------GPU resutls

92096229.070354---------------92096231.872298

3771338.827391---------------3608260.193571

95.167448 ---------------95.159868

80741.074339 --------------- 77249.741954

... ...



If this difference is caused by the 80bit(CPU) vs. 64bit(GPU), how can I solve it? And thank you for reply!

#12
Posted 11/15/2010 03:46 PM   
Scroll To Top