GPU Blocks and Threads tuning on Jetson TK1

KapilMehta · January 25, 2016, 10:28am

For Jetson TK1 i have basic doubt regarding how many blocks and threads can i spawn with Cuda 6.5 ??

I am working on the image size of 1280*1024 and i am mentioning threadsperblock as (32,32) and and blocks as (40,32) (this is calculated as image_width-1280/32 and image_height-1024/32)now if i launch kernel as

Kernel <<blocks ,threadsperblock >> (Arguments to be passes to kernel)

My program hangs and doesn’t give any output…

Now if i change block size as (5,5) it runs successfully but processes only a part of image…so by block size as a (40,32) am i exceeding the size of blocks ?

Or what should be proper tuning of blocks and threads ?

Any input is highly appreciated…

BulatZiganshin · January 25, 2016, 11:42am

try it with trivial kernel doing nothing or filling just one array element

KapilMehta · February 4, 2016, 8:04am

Correct…

but what is the optimal way of launching threads?

Is it better to create number of threads = number of pixes(1280 * 1024) ? or we should write a kernel in a way which will handle more than one pixel so we can reduce total number of spawn threads …

BulatZiganshin · February 4, 2016, 10:22am

afaik, thread launch is essentially free, but any thread should setup some variables. f.e. each thread may need to compute

ptr = array + x*1024 + y
*ptr = 0

while when you process multiple pixels in single thread the code will be

ptr = array + x*1024
for (i=0..max_y)
  *ptr++ = 0

so the usual advice is to start with 1 thread = 1 pixel in order to simplify the code, and replace it with loops if you need a bit more speed