Bugged code in website

Hi all,
I’ve found a bug in the code published on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html.

The example Example 39-1(Naive Scan Algorithm), at row 16 should be

temp[pout*n+thid] = temp[pin*n+thid]+temp[pin*n+thid - offset];

, because otherwise it would get the partial sum of two iterations before…same on the relative pdf (Parallel Prefix Sum
(Scan) with CUDA).

In addition,I think that at row 14 it would be better to write

pin = 1 - pin;

It’s more clear,in addition is avoided an execution dependency with previous line.

According to the pseudo code of this double buffered scan:
1: for d = 1 to log2 n do
2: for all k in parallel do
3: if k >= 2^d then
4: x[out][k] = x[in][k – 2 d-1] + x[in][k]
5: else
6: x[out][k] = x[in][k]

Based on line 4, the bug indeed existed as your described above.

Thanks for your positive reply…Can someone tell me how to report this bug?in order to avoid headaches to guys who want to implement prefix sum

That document (a published chapter as part of a published book) is unlikely to be changed.

Anyone who wants to implement a prefix sum can do so based on the material in that chapter but should not do so blindly without testing it (which is true for any other provided code, as well.)

If you’re interested in a fast high performance prefix sum, you’re advised to use thrust or cub anyway, not the material there, which is incomplete and provided for learning purposes. (For example, it only works within a threadblock, as written, not device-wide).

If you want to file a bug with nvidia, register as a developer at developer.nvidia.com and file the bug using the portal there.

Thanks for your reply, I’ve sent the bug.

I’ve used that code as a base in order to execute a recurrence equation

FYI, the author Harris who wrote this article mentioned in a post:
[url]cuda - CONFLICT_FREE_OFFSET macro used in the parallel prefix algorithm from GPU Gems 3 - Stack Overflow
quote “I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA.”
You can find it in the post above.

Thank you, I’ve found an interesting link inside stackoverflow…

My interest in prefix sum come from the need to implement a recurrence equation y[n]=a_0x[n]+a_1x[n-1]…+b_1y[n-1]+b_2y[n-2]…now I’ve found a more efficient approach than prefix sum over matricies: partially unroll the equation(separating range of X and Y) and perform two FFT