According to the pseudo code of this double buffered scan:
1: for d = 1 to log2 n do
2: for all k in parallel do
3: if k >= 2^d then
4: x[out][k] = x[in][k – 2 d-1] + x[in][k]
5: else
6: x[out][k] = x[in][k]
Based on line 4, the bug indeed existed as your described above.
That document (a published chapter as part of a published book) is unlikely to be changed.
Anyone who wants to implement a prefix sum can do so based on the material in that chapter but should not do so blindly without testing it (which is true for any other provided code, as well.)
If you’re interested in a fast high performance prefix sum, you’re advised to use thrust or cub anyway, not the material there, which is incomplete and provided for learning purposes. (For example, it only works within a threadblock, as written, not device-wide).
If you want to file a bug with nvidia, register as a developer at developer.nvidia.com and file the bug using the portal there.
FYI, the author Harris who wrote this article mentioned in a post:
[url]cuda - CONFLICT_FREE_OFFSET macro used in the parallel prefix algorithm from GPU Gems 3 - Stack Overflow
quote “I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA.”
You can find it in the post above.
Thank you, I’ve found an interesting link inside stackoverflow…
My interest in prefix sum come from the need to implement a recurrence equation y[n]=a_0x[n]+a_1x[n-1]…+b_1y[n-1]+b_2y[n-2]…now I’ve found a more efficient approach than prefix sum over matricies: partially unroll the equation(separating range of X and Y) and perform two FFT