64-bit scan/reduction for min element (double). Two example implementations, but unexpected timing

Ran a experiment with two implementations of a 64 bit double scan/reduction which returns the minimum element with the FIRST occurrence of that min element(there may be multiple min values).

Also calculate the GB/s for each implementation as well as the average running time.

Both implementations attempt to read coalesced memory from the global double array, but use two different methods.

The first uses the standard

for(idx=threadIdx.x+blockIdx.x*blockDim.x;idx<num_elements;idx+=blockDim.x*gridDim.x)..

method which limits the number of thread blocks to exactly fit the number of SMs.

The second method uses templates to somewhat dynamically calculate the work per thread and uses that value in the kernel.(Very similar to the method that Jimmy P uses in his reduction code).

What surprised me was that the latter implementation was about 10% faster than the supposed ‘optimal’ former implementation.

for example an array of 16,777,343 this is the output (averaged over 100 reps)

Num Elements= 16777343

Starting CPU..

CPU time= 879.198 , CPU ans= -987654 , index= 9997

CPU throughput= 15.266 GB/s

Starting GPU0..

Num threads= 256 , total num blocks= 104

GPU time0= 82.505 , GPU ans= -987654 , index= 9997

GPU throughput0= 162.68 GB/s

Starting GPU1..

Num threads= 256 , total num blocks= 513

GPU time2= 76.2341 , GPU ans2= -987654 , index2= 9997

GPU throughput1= 176.061 GB/s

timing is in microseconds.

The CPU used is a I-7 3770K 3.9 GHz 4-core, and the GPU is the Tesla K20c in W7 64.

here is the sample code:

http://pastebin.com/W2Nn3yUB

the first implementation starts on line #241 while the second starts on line #306.

What is even more interesting is that the ‘faster’ second implementation has two parts, while the first does not( uses instead __threadfence() ). Maybe that is the issue.

Maybe I made a mistake in calculating the correct number of max blocks of 256 for the K20c. As I understand it it has 13 SMs, each of which can handle 2048 threads.

As far as I recall, the last architecture which could achieve full memory throughput while running just enough thread blocks to fill the GPU was sm_13. On Fermi (sm_2x) up to 20x “over-subscription” is needed to achieve optimal memory throughput, although memory throughput is still quite good with a smaller number of thread blocks per launch.

The exact Fermi behavior depends on the computational density of kernels: kernels with lots of computation require a lesser degree of over-subscription for optimal memory throughput. To keep things simple I started using configurations with 65520 thread blocks where possible. That number divides evenly by SM counts of 13, 14, 15, and 16 giving roughly equal load across SMs.

I am no exactly sure what the Kepler behavior is in this regard, as for my own CUDA codes I simply carried forward the Fermi setups with the large number of thread clocks launched. However I assume that there is still need for a fairly high thread block count to achieve optimal memory throughput. Your second code has about 5x over-subscription, it could be interesting to try even higher ratios to see whether memory throughput can be improved further.

Great! This exactly answers my question so I will try a few new approaches based on this information.

Thanks

Did some more testing, very interesting…

The latter(Jimmy P) approach is more efficient even if the num_blocks and threads are the same, as can be seen in this output

Num Elements= 16777343

Starting CPU..

CPU time= 892.475 , CPU ans= -987654 , index= 9997

CPU throughput= 15.0389 GB/s

Starting GPU0..

Num threads= 256 , total num blocks= 513

GPU time0= 81.1387 , GPU ans= -987654 , index= 9997

GPU throughput0= 165.419 GB/s

Starting GPU1..

Num threads= 256 , total num blocks= 513

GPU time2= 76.2283 , GPU ans2= -987654 , index2= 9997

GPU throughput1= 176.075 GB/s

Next I will replace the __threadfence() with another kernel, and see if that makes a difference.