The document is saying that when threads are started, each thread is executed on each clock timing.
If I create 512 threads, the last thread will be started after a delay of 512 clocks from the first.
Also __syncthread() is introduced to avoid this delaying problem.
Is this true?
I believed that the limit of 32 warps, 32 threads can be executed on the same clock-phase.
Now I understand the mystery.
For a begginer like me, it is very difficult to judge a document is expired or obsoleted.
There are some web sites, which defines the max number of threads in a block is 512.
I hope that eveyone should specify their hardware, driver, SDK and OS.