sleep

wilku · April 15, 2012, 2:12pm

Is there a command that would allow a cuda thread inside a block to sleep for a given time? Can’t seem to find such a command, but I thought that I should ask here before implementing it on my own.

seibert · April 15, 2012, 3:06pm

Keep in mind that threads do not execute independently (but rather in groups of 32), so it does not entirely make sense to put a single thread to sleep. Do you want to make a warp sleep for some amount of time?

wilku · April 15, 2012, 3:20pm

No, I only needed one thread to sleep, so it can change a flag once in a set period of time. So if I make a thread ‘sleep’ by using a while loop it’ll block the entire warp?

seibert · April 16, 2012, 1:11pm

I’m not sure if the compiler will force the other threads to stall after the loop until the first thread finishes and catches up. That level of detail is not documented, but the underlying hardware can’t run threads completely independently. (It can fake it to ensure that divergence is possible, but it is not like a multicore CPU.)

parallelis · April 19, 2012, 5:22pm

If you have a thread on a WARP or half-WARP that is looping until some condition, all the other 31 or 15 threads of the same WARP or half-WARP will follow the same execution path and thus none of them will start until your first one has exit it’s loop.
Depending on the nVidia GPU you use, it might be 21 32 or 16 threads (WARP or half-WARP) that will follow this pattern, so it’s a great loss of execution time, especially on low-end GPU.

You might consider adding just one thread into a block to handle this condition, given that you insure you have already enough threads to fully exploit the SM in which it runs.
To avoid this thread from taking as much cuda-core time it could, depending on your kernel, you may play with register availability latency or read memory latency (with impact on memory bandwidth that is limited) : on first case, modify a register that you will use as a source on the next operation, and repeat it; on the second case, read randomly the allocated global memory, latency will litterally put this thread to an halt, and the other threads of the same block will run at full speed :)