Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA

I had been profiling the simple kernel
[code]
__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if (i < N1 && j < N2) {
double a = pow(-1.0, (i+j)&1);
data[j*blockDim.x*gridDim.x+i].x *= a;
data[j*blockDim.x*gridDim.x+i].y *= a;
}
}
[/code]
and discovered that, due to the "struct" nature of double2, I had 50% global memory load/store efficiency.
The following solution was suggested at [url]http://stackoverflow.com/questions/14246592/coalesced-memory-access-and-global-memory-load-store-efficiency-with-complex-ari[/url]
[code]
__global__ void fftshift_2D(double *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if (i < N1 * 2 && j < N2) {
double a = pow(-1.0, (i / 2 + j)&1);
data[j*blockDim.x*gridDim.x+i] *= a;
}
}
[/code]
which re-established 100% global memory load/store efficiency, but is much slower.
Anyone has an explanation for that?
Thank you very much in advance.

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;

if (i < N1 && j < N2) {
double a = pow(-1.0, (i+j)&1);

__global__ void fftshift_2D(double *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;

if (i < N1 * 2 && j < N2) {
double a = pow(-1.0, (i / 2 + j)&1);
data[j*blockDim.x*gridDim.x+i] *= a;
}
}

which re-established 100% global memory load/store efficiency, but is much slower.

Double precision pow is not very fast. Especially if you are testing on a GeForce card.
On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I'm not sure.
Note that in the second kernel you will compute twice as many pows as the first one...
Try removing the pow to see what effect that has.
Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.

Double precision pow is not very fast. Especially if you are testing on a GeForce card.

On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I'm not sure.

Note that in the second kernel you will compute twice as many pows as the first one...

Try removing the pow to see what effect that has.

Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.

Thank you very much eelsen. Two good observations in your reply:
1) pow is slow;
2) the second kernel will compute twice as many pows as the first one.
So, I changed the
[code]
double a = pow(-1.0, (i / 2 + j)&1);
[/code]
instruction to
[code]
double a = 1-2*((i/2+j)&1);
[/code]
to toggle between -1 and +1.
Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.

Thank you very much eelsen. Two good observations in your reply:

1) pow is slow;
2) the second kernel will compute twice as many pows as the first one.

So, I changed the

double a = pow(-1.0, (i / 2 + j)&1);

instruction to

double a = 1-2*((i/2+j)&1);

to toggle between -1 and +1.

Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.

Hmm...you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?

Hmm...you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?

Eelsen, thank you very much for your further questions.
Here are the results of the SDK bandwidth test:
[code]
Device 0: GeForce GT 540M
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3820.1
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3730.1
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 23871.5
[bandwidthTest.exe] test results...
PASSED
[/code]
Here are the performance of the two kernels:
[b]double2[/b]
Execution time: 451 microseconds
Blocksize: 16x16
Gridsize: 32x32
Global Store Throughput: 17.91Gb/2
Global Load Throughput: 17.86Gb/2
[b]double2[/b]
Execution time: 916 microseconds
Blocksize: 16x16
Gridsize: 64x32
Global Store Throughput: 5.63Gb/2
Global Load Throughput: 5.62Gb/2
In both the cases:
1) I'm compiling with "Maximize Speed [/O2]" and "Favour Fast Code [/Ot]";
2) data is a 512 elements array, while N1=N2=256.
Please, note that the Gridsizes are different since I'm performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.

Eelsen, thank you very much for your further questions.

Here are the results of the SDK bandwidth test:

Device 0: GeForce GT 540M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3820.1

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3730.1

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 23871.5

[bandwidthTest.exe] test results...
PASSED

Here are the performance of the two kernels:

double2

Execution time: 451 microseconds
Blocksize: 16x16
Gridsize: 32x32
Global Store Throughput: 17.91Gb/2
Global Load Throughput: 17.86Gb/2

double2
Execution time: 916 microseconds
Blocksize: 16x16
Gridsize: 64x32
Global Store Throughput: 5.63Gb/2
Global Load Throughput: 5.62Gb/2

In both the cases:
1) I'm compiling with "Maximize Speed [/O2]" and "Favour Fast Code [/Ot]";
2) data is a 512 elements array, while N1=N2=256.

Please, note that the Gridsizes are different since I'm performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.

I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize.
The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.
Also the bandwidth numbers themselves don't compute:
(array is 256x256, double2 is 16 bytes, read / write)
256 * 256 * 16 * 2 = 2MB
2MB / 451 usec = 4.65 GB/sec
2MB / 916 usec = 2.29 GB/sec
I would recommend simply timing the kernel yourself and calculating the performance figures since the profiler numbers don't seem to make much sense.

I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize.

The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.

Also the bandwidth numbers themselves don't compute:

(array is 256x256, double2 is 16 bytes, read / write)

Hi Eelsen,
I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.
I have timed the two kernels by myself.
double2
Execution time: 450 microseconds (coinciding with the Visual Profiler)
double
Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)
I do not understand why do you count for read/write all together (you multiply by 2). I'm independently assessing the Global Read and Store Throughputs. Could you please explain better?
Could the difference be due to the fact that
[code]
double a = 1-2*((i/2+j)&1);
[/code]
is performed twice for the "double" kernel?
Thanks.

Hi Eelsen,
I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.

I have timed the two kernels by myself.

double2
Execution time: 450 microseconds (coinciding with the Visual Profiler)
double
Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)

I do not understand why do you count for read/write all together (you multiply by 2). I'm independently assessing the Global Read and Store Throughputs. Could you please explain better?

Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8).
As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:
double: 372us
double2: 436us
So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?

Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8).

As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:

double: 372us
double2: 436us

So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?

I have just made another experiment. I calculate the value of a "on the fly" without storing it in a register and I obtain the following timing:
double: 690us
double2: 438us
Note that in the double2 case, the value of a is
[code]
1-2*((i+j)&1)
[/code]
while in the double case, the value of a is
[code]
1-2*((i/2+j)&1)
[/code]
(there is an extra integer division /). If I remove, just for testing purposes, the division in the double kernel, the double timing drops to 400us.

and discovered that, due to the "struct" nature of double2, I had 50% global memory load/store efficiency.

The following solution was suggested at http://stackoverflow.com/questions/14246592/coalesced-memory-access-and-global-memory-load-store-efficiency-with-complex-ari

which re-established 100% global memory load/store efficiency, but is much slower.

Anyone has an explanation for that?

Thank you very much in advance.

wwww.orangeowlsolutions.com

On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I'm not sure.

Note that in the second kernel you will compute twice as many pows as the first one...

Try removing the pow to see what effect that has.

Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.

1) pow is slow;

2) the second kernel will compute twice as many pows as the first one.

So, I changed the

instruction to

to toggle between -1 and +1.

Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.

wwww.orangeowlsolutions.com

Here are the results of the SDK bandwidth test:

Here are the performance of the two kernels:

double2Execution time: 451 microseconds

Blocksize: 16x16

Gridsize: 32x32

Global Store Throughput: 17.91Gb/2

Global Load Throughput: 17.86Gb/2

double2Execution time: 916 microseconds

Blocksize: 16x16

Gridsize: 64x32

Global Store Throughput: 5.63Gb/2

Global Load Throughput: 5.62Gb/2

In both the cases:

1) I'm compiling with "Maximize Speed [/O2]" and "Favour Fast Code [/Ot]";

2) data is a 512 elements array, while N1=N2=256.

Please, note that the Gridsizes are different since I'm performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.

wwww.orangeowlsolutions.com

The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.

Also the bandwidth numbers themselves don't compute:

(array is 256x256, double2 is 16 bytes, read / write)

256 * 256 * 16 * 2 = 2MB

2MB / 451 usec = 4.65 GB/sec

2MB / 916 usec = 2.29 GB/sec

I would recommend simply timing the kernel yourself and calculating the performance figures since the profiler numbers don't seem to make much sense.

I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.

I have timed the two kernels by myself.

double2

Execution time: 450 microseconds (coinciding with the Visual Profiler)

double

Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)

I do not understand why do you count for read/write all together (you multiply by 2). I'm independently assessing the Global Read and Store Throughputs. Could you please explain better?

Could the difference be due to the fact that

is performed twice for the "double" kernel?

Thanks.

wwww.orangeowlsolutions.com

As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:

double: 372us

double2: 436us

So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?

wwww.orangeowlsolutions.com

double: 690us

double2: 438us

Note that in the double2 case, the value of a is

while in the double case, the value of a is

(there is an extra integer division /). If I remove, just for testing purposes, the division in the double kernel, the double timing drops to 400us.

wwww.orangeowlsolutions.com