Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA
I had been profiling the simple kernel [code] __global__ void fftshift_2D(double2 *data, int N1, int N2) { int i = threadIdx.x + blockDim.x * blockIdx.x; int j = threadIdx.y + blockDim.y * blockIdx.y; if (i < N1 && j < N2) { double a = pow(-1.0, (i+j)&1); data[j*blockDim.x*gridDim.x+i].x *= a; data[j*blockDim.x*gridDim.x+i].y *= a; } } [/code] and discovered that, due to the "struct" nature of double2, I had 50% global memory load/store efficiency. The following solution was suggested at [url]http://stackoverflow.com/questions/14246592/coalesced-memory-access-and-global-memory-load-store-efficiency-with-complex-ari[/url] [code] __global__ void fftshift_2D(double *data, int N1, int N2) { int i = threadIdx.x + blockDim.x * blockIdx.x; int j = threadIdx.y + blockDim.y * blockIdx.y; if (i < N1 * 2 && j < N2) { double a = pow(-1.0, (i / 2 + j)&1); data[j*blockDim.x*gridDim.x+i] *= a; } } [/code] which re-established 100% global memory load/store efficiency, but is much slower. Anyone has an explanation for that? Thank you very much in advance.
I had been profiling the simple kernel

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;

if (i < N1 && j < N2) {
double a = pow(-1.0, (i+j)&1);

data[j*blockDim.x*gridDim.x+i].x *= a;
data[j*blockDim.x*gridDim.x+i].y *= a;
}
}


and discovered that, due to the "struct" nature of double2, I had 50% global memory load/store efficiency.

The following solution was suggested at http://stackoverflow.com/questions/14246592/coalesced-memory-access-and-global-memory-load-store-efficiency-with-complex-ari

__global__ void fftshift_2D(double *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;

if (i < N1 * 2 && j < N2) {
double a = pow(-1.0, (i / 2 + j)&1);
data[j*blockDim.x*gridDim.x+i] *= a;
}
}


which re-established 100% global memory load/store efficiency, but is much slower.

Anyone has an explanation for that?

Thank you very much in advance.

wwww.orangeowlsolutions.com

#1
Posted 01/10/2013 04:05 PM   
Double precision pow is not very fast. Especially if you are testing on a GeForce card. On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I'm not sure. Note that in the second kernel you will compute twice as many pows as the first one... Try removing the pow to see what effect that has. Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.
Double precision pow is not very fast. Especially if you are testing on a GeForce card.

On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I'm not sure.

Note that in the second kernel you will compute twice as many pows as the first one...

Try removing the pow to see what effect that has.

Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.

#2
Posted 01/10/2013 04:27 PM   
Thank you very much eelsen. Two good observations in your reply: 1) pow is slow; 2) the second kernel will compute twice as many pows as the first one. So, I changed the [code] double a = pow(-1.0, (i / 2 + j)&1); [/code] instruction to [code] double a = 1-2*((i/2+j)&1); [/code] to toggle between -1 and +1. Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.
Thank you very much eelsen. Two good observations in your reply:

1) pow is slow;
2) the second kernel will compute twice as many pows as the first one.

So, I changed the

double a = pow(-1.0, (i / 2 + j)&1);


instruction to

double a = 1-2*((i/2+j)&1);


to toggle between -1 and +1.

Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.

wwww.orangeowlsolutions.com

#3
Posted 01/12/2013 09:40 PM   
Hmm...you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?
Hmm...you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?

#4
Posted 01/14/2013 04:53 PM   
Eelsen, thank you very much for your further questions. Here are the results of the SDK bandwidth test: [code] Device 0: GeForce GT 540M Quick Mode Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 3820.1 Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 3730.1 Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 23871.5 [bandwidthTest.exe] test results... PASSED [/code] Here are the performance of the two kernels: [b]double2[/b] Execution time: 451 microseconds Blocksize: 16x16 Gridsize: 32x32 Global Store Throughput: 17.91Gb/2 Global Load Throughput: 17.86Gb/2 [b]double2[/b] Execution time: 916 microseconds Blocksize: 16x16 Gridsize: 64x32 Global Store Throughput: 5.63Gb/2 Global Load Throughput: 5.62Gb/2 In both the cases: 1) I'm compiling with "Maximize Speed [/O2]" and "Favour Fast Code [/Ot]"; 2) data is a 512 elements array, while N1=N2=256. Please, note that the Gridsizes are different since I'm performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.
Eelsen, thank you very much for your further questions.

Here are the results of the SDK bandwidth test:

Device 0: GeForce GT 540M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3820.1

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3730.1

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 23871.5

[bandwidthTest.exe] test results...
PASSED


Here are the performance of the two kernels:

double2

Execution time: 451 microseconds
Blocksize: 16x16
Gridsize: 32x32
Global Store Throughput: 17.91Gb/2
Global Load Throughput: 17.86Gb/2

double2
Execution time: 916 microseconds
Blocksize: 16x16
Gridsize: 64x32
Global Store Throughput: 5.63Gb/2
Global Load Throughput: 5.62Gb/2

In both the cases:
1) I'm compiling with "Maximize Speed [/O2]" and "Favour Fast Code [/Ot]";
2) data is a 512 elements array, while N1=N2=256.

Please, note that the Gridsizes are different since I'm performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.

wwww.orangeowlsolutions.com

#5
Posted 01/16/2013 08:20 AM   
I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize. The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible. Also the bandwidth numbers themselves don't compute: (array is 256x256, double2 is 16 bytes, read / write) 256 * 256 * 16 * 2 = 2MB 2MB / 451 usec = 4.65 GB/sec 2MB / 916 usec = 2.29 GB/sec I would recommend simply timing the kernel yourself and calculating the performance figures since the profiler numbers don't seem to make much sense.
I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize.

The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.

Also the bandwidth numbers themselves don't compute:

(array is 256x256, double2 is 16 bytes, read / write)

256 * 256 * 16 * 2 = 2MB

2MB / 451 usec = 4.65 GB/sec
2MB / 916 usec = 2.29 GB/sec

I would recommend simply timing the kernel yourself and calculating the performance figures since the profiler numbers don't seem to make much sense.

#6
Posted 01/16/2013 04:46 PM   
Hi Eelsen, I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512. I have timed the two kernels by myself. double2 Execution time: 450 microseconds (coinciding with the Visual Profiler) double Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler) I do not understand why do you count for read/write all together (you multiply by 2). I'm independently assessing the Global Read and Store Throughputs. Could you please explain better? Could the difference be due to the fact that [code] double a = 1-2*((i/2+j)&1); [/code] is performed twice for the "double" kernel? Thanks.
Hi Eelsen,
I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.

I have timed the two kernels by myself.

double2
Execution time: 450 microseconds (coinciding with the Visual Profiler)
double
Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)

I do not understand why do you count for read/write all together (you multiply by 2). I'm independently assessing the Global Read and Store Throughputs. Could you please explain better?

Could the difference be due to the fact that

double a = 1-2*((i/2+j)&1);


is performed twice for the "double" kernel?

Thanks.

wwww.orangeowlsolutions.com

#7
Posted 01/18/2013 03:18 PM   
Those extra couple of operations shouldn't matter. Did you try changing the blockSize to (32, 8)? My feeling is that that is the culprit.
Those extra couple of operations shouldn't matter. Did you try changing the blockSize to (32, 8)? My feeling is that that is the culprit.

#8
Posted 01/18/2013 06:30 PM   
Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8). As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results: double: 372us double2: 436us So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?
Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8).

As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:

double: 372us
double2: 436us

So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?

wwww.orangeowlsolutions.com

#9
Posted 01/20/2013 09:13 AM   
I have just made another experiment. I calculate the value of a "on the fly" without storing it in a register and I obtain the following timing: double: 690us double2: 438us Note that in the double2 case, the value of a is [code] 1-2*((i+j)&1) [/code] while in the double case, the value of a is [code] 1-2*((i/2+j)&1) [/code] (there is an extra integer division /). If I remove, just for testing purposes, the division in the double kernel, the double timing drops to 400us.
I have just made another experiment. I calculate the value of a "on the fly" without storing it in a register and I obtain the following timing:

double: 690us
double2: 438us

Note that in the double2 case, the value of a is

1-2*((i+j)&1)


while in the double case, the value of a is

1-2*((i/2+j)&1)


(there is an extra integer division /). If I remove, just for testing purposes, the division in the double kernel, the double timing drops to 400us.

wwww.orangeowlsolutions.com

#10
Posted 01/20/2013 09:27 AM   
Scroll To Top