Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,.

Hi everyone,

I would like to share an update to some reduction sum code that I shared on these forums in 2010 ( The Official NVIDIA Forums | NVIDIA ).

Update notes:

  • Uses shuffle instructions warp reduction on 3.0+ cards.

  • Increased from 64 threads / block to 64x4 => 256 threads per block ( needed for Kepler )

  • Theoretical max bandwidth is now read from the device properties

Performance tests ( perc. column is percentage of theoretical bandwidth utilized ).

Fermi (GTX480):

GeForce GTX 480 @ 177.408 GB/s

 N               [GB/s]          [perc]          [usec]          test

 1048576         117.63          66.30   		35.7             Pass

 2097152         136.59          76.99   		61.4             Pass

 4194304         145.20          81.85   		115.5            Pass

 8388608         151.14          85.19   		222.0            Pass

 16777216        154.58          87.13  		434.1            Pass

 33554432        156.07          87.97  		860.0            Pass

 67108864        156.77          88.36   		1712.3           Pass

 134217728       157.13          88.57   		3416.7           Pass

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test

 14680102        154.16          86.90   		380.9            Pass

 14680119        154.21          86.92   		380.8            Pass

 18875600        152.14          85.76   		496.3            Pass

 7434886         103.91          58.57  		286.2            Pass

 13324075        142.84          80.52   		373.1            Pass

 15764213        147.03          82.88   		428.9            Pass

 1850154         46.67           26.30   		158.6            Pass

 4991241         96.37           54.32   		207.2            Pass

Some tests on my laptop:

Kepler (GT 650m, GK107):

GeForce GT 650M @ 64.000 GB/s

N               [GB/s]          [perc]          [usec]          test

 1048576         37.34           58.35   112.3            Pass

 2097152         40.44           63.19   207.4            Pass

 4194304         42.94           67.09   390.7            Pass

 8388608         44.25           69.14   758.3            Pass

 16777216        45.08           70.44   1488.6                   Pass

 33554432        45.48           71.06   2951.4                   Pass

 67108864        47.10           73.59   5699.4                   Pass

 134217728       48.89           76.39   10981.8                  Pass

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test

 14680102        48.15           75.23   1219.6                   Pass

 14680119        48.13           75.20   1220.1                   Pass

 18875600        48.07           75.12   1570.5                   Pass

 7434886         41.39           64.67   718.5            Pass

 13324075        46.72           73.00   1140.8                   Pass

 15764213        47.37           74.02   1331.1                   Pass

 1850154         26.44           41.31   279.9            Pass

 4991241         39.41           61.58   506.6            Pass

As you can see the GTX480 performs very near peak bandwidth (88.5 % of peak ) while the GT 650m doesn’t reach quite as high ( 76.39 % of peak ). I guess the similarities with GF104 with respect to ILP makes this style of arch. a bit harder to reach peak for, for example the SDK bandwidth test only reaches about 32 GB/s (50%) on my GT 650m.

Here is a comparison on my laptop VS the SDK reduction sum example:

Bandwidth:

[attachment=25852:bandwidth_GT650m.png]

Speedup:

[attachment=25853:speedup_VS_NV_GT650m.png]

I hope all my numbers add up and that it runs well on other machines aswell External Image

The source code gives the test printout data shown above:

Source code:

[attachment=25854:reduction_main.cu]

What kind of performance do you get if you simply use an atomic add? According to the Kepler whitepaper, the atomic add for a shared address should be about 11 times faster compared to Fermi. I’m however not sure if this is only true for Tesla, or also for Geforce.

I had to do a little fiddling to get this to compile on Linux. (remove the pause() function because there is no fgetchar() and it looks like you only half-finished get_clock for the non-Win32 folks :) )

GTX 580 (sm_20)

GeForce GTX 580 @ 192.384 GB/s 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 131.86 		 68.54 % 	 31.8 		 Pass 

 2097152 	 143.22 		 74.45 % 	 58.6 		 Pass 

 4194304 	 150.28 		 78.11 % 	 111.6 		 Pass 

 8388608 	 158.76 		 82.52 % 	 211.4 		 Pass 

 16777216 	 161.62 		 84.01 % 	 415.2 		 Pass 

 33554432 	 163.76 		 85.12 % 	 819.6 		 Pass 

 67108864 	 164.75 		 85.64 % 	 1629.3 		 Pass 

 134217728 	 165.31 		 85.93 % 	 3247.7 		 Pass 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 160.68 		 83.52 % 	 365.5 		 Pass 

 14680119 	 160.51 		 83.43 % 	 365.8 		 Pass 

 18875600 	 159.62 		 82.97 % 	 473.0 		 Pass 

 7434886 	 109.15 		 56.73 % 	 272.5 		 Pass 

 5392180 	 140.80 		 73.19 % 	 153.2 		 Pass 

 3449291 	 92.44  		 48.05 % 	 149.3 		 Pass 

 11351121 	 122.78 		 63.82 % 	 369.8 		 Pass 

 2455799 	 60.63  		 31.51 % 	 162.0 		 Pass

GTX 680 (sm_20)

GeForce GTX 680 @ 192.256 GB/s 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 110.93 		 57.70 % 	 37.8 		 Pass 

 2097152 	 126.95 		 66.03 % 	 66.1 		 Pass 

 4194304 	 137.98 		 71.77 % 	 121.6 		 Pass 

 8388608 	 144.59 		 75.21 % 	 232.1 		 Pass 

 16777216 	 148.17 		 77.07 % 	 452.9 		 Pass 

 33554432 	 150.00 		 78.02 % 	 894.8 		 Pass 

 67108864 	 150.97 		 78.52 % 	 1778.1 		 Pass 

 134217728 	 151.70 		 78.90 % 	 3539.1 		 Pass 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 147.87 		 76.91 % 	 397.1 		 Pass 

 14680119 	 148.05 		 77.01 % 	 396.6 		 Pass 

 18875600 	 146.84 		 76.38 % 	 514.2 		 Pass 

 7434886 	 107.20 		 55.76 % 	 277.4 		 Pass 

 5392180 	 131.54 		 68.42 % 	 164.0 		 Pass 

 3449291 	 92.75  		 48.24 % 	 148.8 		 Pass 

 11351121 	 119.28 		 62.04 % 	 380.6 		 Pass 

 2455799 	 63.87  		 33.22 % 	 153.8 		 Pass

GTX 680 (sm_30):

GeForce GTX 680 @ 192.256 GB/s 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 111.52 		 58.01 % 	 37.6 		 Pass 

 2097152 	 127.78 		 66.46 % 	 65.7 		 Pass 

 4194304 	 138.73 		 72.16 % 	 120.9 		 Pass 

 8388608 	 144.78 		 75.31 % 	 231.8 		 Pass 

 16777216 	 148.20 		 77.08 % 	 452.8 		 Pass 

 33554432 	 148.80 		 77.40 % 	 902.0 		 Pass 

 67108864 	 151.29 		 78.69 % 	 1774.3 		 Pass 

 134217728 	 151.66 		 78.89 % 	 3539.9 		 Pass 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 148.15 		 77.06 % 	 396.4 		 Pass 

 14680119 	 148.30 		 77.13 % 	 396.0 		 Pass 

 18875600 	 147.13 		 76.53 % 	 513.2 		 Pass 

 7434886 	 107.64 		 55.99 % 	 276.3 		 Pass 

 5392180 	 132.27 		 68.80 % 	 163.1 		 Pass 

 3449291 	 93.22 		 48.49 % 	 148.0 		 Pass 

 11351121 	 119.36 		 62.08 % 	 380.4 		 Pass 

 2455799 	 64.11 		 33.35 % 	 153.2 		 Pass

Unfortunately, it looks like the shuffle instruction is no help for this problem.

Edit: On further reflection, that doesn’t seem to be surprising since the reduction seems to be completely saturating the memory bandwidth once the array gets big enough. The time spent doing the reduction in the block is mostly negligible.

Sorry about that! Didn’t do any testing on a linux machine before posting, just made some optimistic assumptions :P

I will post an update addressing this when I have time…

Yes the final per warp reduction is extemely negligible since each warp is first accumulating like ~2048 elements. But I do remember seeing some marginal improvements ( ~0.3% ) on my laptop.

Interesting results on the GTX680! I’m a bit disappointed that it doesn’t reach quite as high as on Fermi. Do you think there’s any chance that GK104 has partition camping issues ? I know this was supposed to have been adressed for Fermi but I believe it was still visible with GF100.

Anyways, I’m reaching higher bandwidth utilization doing this reduction sum than I am running the bandwith test so it’s not too bad :)

These numbers for the GTX 680 are pretty consistent with what I see in the bandwidth test:

Device 0: GeForce GTX 580

 Quick Mode

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			159881.0

----

Device 3: GeForce GTX 680

 Quick Mode

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			149909.3

The 680 really seems to have slightly less usable bandwidth than the 580.

Yeah I wonder if that correlates strongly with a higher clocked but thinner memory bus ( 256-bit @ 6 Gbps vs 384-bit @ 4 Gbps )

Out of curiosity, I took a look at the speed of doing reductions with atomic functions. While not as fast as the above reduction implementation, it is not as bad as you might think. I compared a very simple implementation that uses both shared and global atomic functions:

__global__ void shared_reduce(int size, float *in, float *out)

{

  __shared__ float sum;

int start_idx = blockIdx.x * blockDim.x + threadIdx.x;

  int stride = blockDim.x * gridDim.x;

if (threadIdx.x == 0)

    sum = 0.0f;

__syncthreads();

  float local_sum = 0.0f;

  for (int i=start_idx; i < size; i+=stride) {

    local_sum += in[i];

  }

  atomicAdd(&sum, local_sum);

__syncthreads();

if (threadIdx.x == 0)

    atomicAdd(out, sum);

}

to a very short version that only uses global atomic functions:

__global__ void global_reduce(int size, float *in, float *out)

{

  int start_idx = blockIdx.x * blockDim.x + threadIdx.x;

  int stride = blockDim.x * gridDim.x;

float local_sum = 0.0f;

for (int i=start_idx; i < size; i+=stride) {

    local_sum += in[i];

  }

atomicAdd(out, local_sum);

}

Here’s the result. (Disclaimer: I was in a hurry, so I wrote my benchmark runner in PyCUDA, but duplicated the testing methodology used above):

GeForce GTX 580 @ 192.384 GB/s

Reduce function: shared

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		63.881		33.20		     65.7		Pass

  2097152		103.666		53.89		     80.9		Pass

  4194304		115.521		60.05		    145.2		Pass

  8388608		122.676		63.77		    273.5		Pass

 16777216		126.570		65.79		    530.2		Pass

 33554432		128.281		66.68		   1046.3		Pass

 67108864		128.619		66.86		   2087.1		Pass

134217728		128.487		66.79		   4178.4		Pass

----------------------------------------

Reduce function: global

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		15.132		7.87		    277.2		Pass

  2097152		27.228		14.15		    308.1		Pass

  4194304		45.301		23.55		    370.3		Pass

  8388608		67.612		35.14		    496.3		Pass

 16777216		89.711		46.63		    748.1		Pass

 33554432		106.965		55.60		   1254.8		Pass

 67108864		118.224		61.45		   2270.6		Pass

134217728		124.512		64.72		   4311.8		Pass

=====================================

GeForce GTX 680 @ 192.256 GB/s

Reduce function: shared

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		61.490		31.98		     68.2		Pass

  2097152		98.796		51.39		     84.9		Pass

  4194304		117.150		60.93		    143.2		Pass

  8388608		128.847		67.02		    260.4		Pass

 16777216		132.613		68.98		    506.1		Pass

 33554432		132.515		68.93		   1012.8		Pass

 67108864		135.473		70.46		   1981.5		Pass

134217728		132.728		69.04		   4044.9		Pass

----------------------------------------

Reduce function: global

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		54.813		28.51		     76.5		Pass

  2097152		80.746		42.00		    103.9		Pass

  4194304		105.048		54.64		    159.7		Pass

  8388608		123.176		64.07		    272.4		Pass

 16777216		131.892		68.60		    508.8		Pass

 33554432		135.714		70.59		    989.0		Pass

 67108864		134.765		70.10		   1991.9		Pass

134217728		133.599		69.49		   4018.5		Pass

This was after some not-entirely-rigorous testing of different block and grid sizes, and settling on 128 threads per block, 128 blocks as working pretty well for both cards.

This interesting thing to note is that with the GTX 680, shared memory atomics are less important for reductions on medium-sized arrays. Or to put it another way: If you are in a hurry, you can write a really simple-minded reduction and be in the “good enough” zone. :)

How does reduction with atomic functions compare to reduction with the thrust library?

And if you throw in the warp reduce function from Jimmy’s code:

__device__ float warp_reduce_registers(float myVal)

{

int warpIndex = threadIdx.x%32; 

myVal += __shfl(myVal, warpIndex + 16);

  myVal += __shfl(myVal, warpIndex + 8);

  myVal += __shfl(myVal, warpIndex + 4);

  myVal += __shfl(myVal, warpIndex + 2);

  myVal += __shfl(myVal, warpIndex + 1);

return myVal;

}

then you have absolutely no reason to use shared memory at all to do a reduction with atomics on the GTX 680:

GeForce GTX 680 @ 192.256 GB/s

Reduce function: shared

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		63.503		33.03		     66.0		Pass

  2097152		124.261		64.63		     67.5		Pass

  4194304		131.950		68.63		    127.1		Pass

  8388608		134.957		70.20		    248.6		Pass

 16777216		134.649		70.04		    498.4		Pass

 33554432		133.369		69.37		   1006.4		Pass

 67108864		135.977		70.73		   1974.1		Pass

134217728		132.816		69.08		   4042.2		Pass

----------------------------------------

Reduce function: global

N			[GB/sec]	[perc]		   [usec]		[test]

  1048576		68.324		35.54		     61.4		Pass

  2097152		129.274		67.24		     64.9		Pass

  4194304		136.257		70.87		    123.1		Pass

  8388608		138.322		71.95		    242.6		Pass

 16777216		138.197		71.88		    485.6		Pass

 33554432		136.471		70.98		    983.5		Pass

 67108864		134.998		70.22		   1988.4		Pass

134217728		133.727		69.56		   4014.7		Pass

I’m not too surprised that the atomics perform so well given that most of the computation time is actually spent in the for loop ( which is probably also true in Seiberts tests ). It does seem to confirm that Kepler atomics have a serious advantage over Fermi. It’s nice to see that you can simplify the code a lot and still get good performance! :)

@wanderine
I’m curious why the big interest in using atomics in this case when it’s not needed ?

I tried the thrust library and it was much slower than my own simple code.

I want to calculate 30 different values in each position of a 3D grid (like 128 x 128 x 128) and then sum the 30 values separately (i.e. 30 sums). If I use atomics, the code would be a single kernel, instead of the three kernels that I use now (calculate values and sum over x, sum over y, sum over z).

Sounds like a good place to try atomics, then. The amount of data being moved around is small enough that the overhead of launching new kernels probably swamps any benefit of doing a proper reduction without atomics.

Here is a result update for the GTX Titan. It runs at ~245 GB/s which is roughly 85% bandwidth utilization!

Here is a snippet from my machine:

GeForce GTX TITAN @ 288.400 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         121.02                  41.96   34.7             Pass
 2097152         156.65                  54.32   53.6             Pass
 4194304         189.86                  65.83   88.4             Pass
 8388608         213.77                  74.12   157.0            Pass
 16777216        226.42                  78.51   296.4            Pass
 33554432        237.23                  82.26   565.8            Pass
 67108864        242.16                  83.97   1108.5                   Pass
 134217728       244.46                  84.76   2196.2                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        230.39                  79.88   254.9            Pass
 14680119        233.64                  81.01   251.3            Pass
 18875600        230.38                  79.88   327.7            Pass
 7434886         168.89                  58.56   176.1            Pass
 13324075        218.40                  75.73   244.0            Pass
 15764213        224.44                  77.82   280.9            Pass
 1850154         83.06                   28.80   89.1             Pass
 4991241         154.65                  53.62   129.1            Pass

Wow!

Yeah the GTX Titan is a real beast! :)

It seems to draw some benefit from the the warp shuffle!

Jimmy,

Thanks for posting.

Rather than a sum reduction, how about a scan which returns the max/min of an array and the corresponding index of that max/min. Something like thrust::max_element(). How would one implement that type of scan using __shfl() with or without atomics?

Also why is your device-to-device bandwidth so high? Is that also dependent on the PCI-e version?

No problem.

I implemented such a min/max scan a long time ago for Fermi and earlier arch (hence no shuffle) in a post on [1]. It would need to be optimized and verified again for kepler arch. And I would do it without using atomics (although that is an option).

Also why is your device-to-device bandwidth so high? Is that also dependent on the PCI-e version?

It is independent of the PCI-e version as I’m only measuring D2D transfers e.g. device RAM to device RAM. The GTX Titan can theoretically do 288 GB/s. The code is optimized fo high bandwidth utilization.

[1] Finding Minimum in array - CUDA Programming and Performance - NVIDIA Developer Forums

Oh ok. I have a 680 2GB and a k20c , so an optimized version of the scan would be cool.Will play with that this weekend.

I had already written my own versions, one which used atomics and one using shared memory. But I think your version is better so thanks for the link.

When I do a bandwidth test I get a Device to Device speed of 161737.7, so I guess that is ok.

Do you have a link to some of your related projects on Github or Google code? I would be very interested in seeing more of your work.

Here is my github page, but some of that code is old and needs to be updated. I have only been working with GPUs for a 5 months;

[url]https://github.com/OlegKonings[/url]

In particular I would like to further optimize the CUDA Floyd Warshall algorithm. Any ideas there?

Many Thanks,

Oleg

Since I work for a GPU computing consultancy in Sweden I actually quite rarely get time to collaborate on different code projects.

Here are some things I’ve worked on before that I can share:

Benchmarking and checking feasibility for a very large suite of radar signal processing algorithms:
-http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf
Fast batched LU decomposition solver

I expect to be able to release the fast sorter later on this year.