Double precision pow is not very fast. Especially if you are testing on a GeForce card.
On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I’m not sure.
Note that in the second kernel you will compute twice as many pows as the first one…
Try removing the pow to see what effect that has.
Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.
Thank you very much eelsen. Two good observations in your reply:
pow is slow;
the second kernel will compute twice as many pows as the first one.
So, I changed the
double a = pow(-1.0, (i / 2 + j)&1);
instruction to
double a = 1-2*((i/2+j)&1);
to toggle between -1 and +1.
Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.
Hmm…you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?
Eelsen, thank you very much for your further questions.
Here are the results of the SDK bandwidth test:
Device 0: GeForce GT 540M
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3820.1
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3730.1
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 23871.5
[bandwidthTest.exe] test results...
PASSED
Here are the performance of the two kernels:
double2
Execution time: 451 microseconds
Blocksize: 16x16
Gridsize: 32x32
Global Store Throughput: 17.91Gb/2
Global Load Throughput: 17.86Gb/2
double2
Execution time: 916 microseconds
Blocksize: 16x16
Gridsize: 64x32
Global Store Throughput: 5.63Gb/2
Global Load Throughput: 5.62Gb/2
In both the cases:
I’m compiling with “Maximize Speed [/O2]” and “Favour Fast Code [/Ot]”;
data is a 512 elements array, while N1=N2=256.
Please, note that the Gridsizes are different since I’m performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.
I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize.
The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.
Also the bandwidth numbers themselves don’t compute:
(array is 256x256, double2 is 16 bytes, read / write)
Hi Eelsen,
I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.
I have timed the two kernels by myself.
double2
Execution time: 450 microseconds (coinciding with the Visual Profiler)
double
Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)
I do not understand why do you count for read/write all together (you multiply by 2). I’m independently assessing the Global Read and Store Throughputs. Could you please explain better?
Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8).
As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:
double: 372us
double2: 436us
So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?