More efficient would likely be this: stuff the value into the high bits of an integer, the lane index into the lowest five bits, then use the regular shuffle-based maximum search. After the maximum has been computed, extract maximum value and index. In case of multiple lanes holding the maximum value, this would return the index of the highest lane holding that value. If you want the lowest lane index instead, store 32-lane_idx into the low-order five bits.
This assumes you can somehow easily compress your ‘float’ data into 27 bits, which may or may not be possible. If the range of the data is limited, the easiest way would probably be to remove the sign and some exponent bits to achieve the compression.
This will return the highest lane index in case there are multiple matches.
You might want to convince yourself that your max.f32 (or max.ftz.f32) and setp.eq.f32 (or setp.eq.ftz.f32) are going to work with your input data. @njuffa can probably describe a situation where the reduction wouldn’t match a lane value?
I was thinking about NaN’s and/or accidentally mixing in .ftz’s. I think there is nothing to worry about if you don’t do anything dumb in PTX. :)
I found some macro that I can use in CUDA code to access the bfind PTX instruction
// __bind(unsigned int i): Find the most significant bit in a 32/64 number (PTX).
__device__ __forceinline__ int __bfind(unsigned int i) const { int b; asm volatile("bfind.u32 %0, %1;" : "=r"(b) : "r"(i)); return b; }
I’ve already got some an implementation of a butterfly reduction.
I don’t know how ballot() performance compares to a predicated smem store + broadcast load.
I’ve used both approaches in my code though!
Trivia: pre-Maxwell the highest lane would “win” the smem store. That appears to no longer be the case.
One other point for @cbuchner1 – I implied you needed PTX but, as I’m sure you’ve determined, you can implement your reduction entirely with intrinsics.
I believe the vote instruction is the fastest returning instruction on the device. On Maxwell it only requires 2 clocks to execute.
I recently had to write some similar code. I had each thread in the warp calculating a lookup table entry in shared memory, but some of the threads could generate empty entries and I wanted the lookup table holes filled in by shifting things left as needed. So I used a vote/popc combination with some bit mask logic.
I almost went with a complicated warp shuffle approach but then realized vote was all I needed. Handy little instruction.