sleep
Is there a command that would allow a cuda thread inside a block to sleep for a given time? Can't seem to find such a command, but I thought that I should ask here before implementing it on my own.
Is there a command that would allow a cuda thread inside a block to sleep for a given time? Can't seem to find such a command, but I thought that I should ask here before implementing it on my own.

#1
Posted 04/15/2012 02:12 PM   
Keep in mind that threads do not execute independently (but rather in groups of 32), so it does not entirely make sense to put a single thread to sleep. Do you want to make a warp sleep for some amount of time?
Keep in mind that threads do not execute independently (but rather in groups of 32), so it does not entirely make sense to put a single thread to sleep. Do you want to make a warp sleep for some amount of time?

#2
Posted 04/15/2012 03:06 PM   
No, I only needed one thread to sleep, so it can change a flag once in a set period of time. So if I make a thread 'sleep' by using a while loop it'll block the entire warp?
No, I only needed one thread to sleep, so it can change a flag once in a set period of time. So if I make a thread 'sleep' by using a while loop it'll block the entire warp?

#3
Posted 04/15/2012 03:20 PM   
I'm not sure if the compiler will force the other threads to stall after the loop until the first thread finishes and catches up. That level of detail is not documented, but the underlying hardware can't run threads completely independently. (It can fake it to ensure that divergence is possible, but it is not like a multicore CPU.)
I'm not sure if the compiler will force the other threads to stall after the loop until the first thread finishes and catches up. That level of detail is not documented, but the underlying hardware can't run threads completely independently. (It can fake it to ensure that divergence is possible, but it is not like a multicore CPU.)

#4
Posted 04/16/2012 01:11 PM   
If you have a thread on a WARP or half-WARP that is looping until some condition, all the other 31 or 15 threads of the same WARP or half-WARP will follow the same execution path and thus none of them will start until your first one has exit it's loop.
Depending on the nVidia GPU you use, it might be [s]21[/s] 32 or 16 threads (WARP or half-WARP) that will follow this pattern, so it's a great loss of execution time, especially on low-end GPU.

You might consider adding just one thread into a block to handle this condition, given that you insure you have already enough threads to fully exploit the SM in which it runs.
To avoid this thread from taking as much cuda-core time it could, depending on your kernel, you may play with register availability latency or read memory latency (with impact on memory bandwidth that is limited) : on first case, modify a register that you will use as a source on the next operation, and repeat it; on the second case, read randomly the allocated global memory, latency will litterally put this thread to an halt, and the other threads of the same block will run at full speed :)
If you have a thread on a WARP or half-WARP that is looping until some condition, all the other 31 or 15 threads of the same WARP or half-WARP will follow the same execution path and thus none of them will start until your first one has exit it's loop.

Depending on the nVidia GPU you use, it might be 21 32 or 16 threads (WARP or half-WARP) that will follow this pattern, so it's a great loss of execution time, especially on low-end GPU.



You might consider adding just one thread into a block to handle this condition, given that you insure you have already enough threads to fully exploit the SM in which it runs.

To avoid this thread from taking as much cuda-core time it could, depending on your kernel, you may play with register availability latency or read memory latency (with impact on memory bandwidth that is limited) : on first case, modify a register that you will use as a source on the next operation, and repeat it; on the second case, read randomly the allocated global memory, latency will litterally put this thread to an halt, and the other threads of the same block will run at full speed :)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#5
Posted 04/19/2012 05:22 PM   
Scroll To Top