GPU Blocks and Threads tuning on Jetson TK1

For Jetson TK1 i have basic doubt regarding how many blocks and threads can i spawn with Cuda 6.5 ??

I am working on the image size of 1280*1024 and i am mentioning threadsperblock as (32,32) and and blocks as (40,32) (this is calculated as image_width-1280/32 and image_height-1024/32)now if i launch kernel as

Kernel <<blocks ,threadsperblock >> (Arguments to be passes to kernel)

My program hangs and doesn’t give any output…

Now if i change block size as (5,5) it runs successfully but processes only a part of image…so by block size as a (40,32) am i exceeding the size of blocks ?

Or what should be proper tuning of blocks and threads ?

Any input is highly appreciated…

try it with trivial kernel doing nothing or filling just one array element

Correct…

but what is the optimal way of launching threads?

Is it better to create number of threads = number of pixes(1280 * 1024) ? or we should write a kernel in a way which will handle more than one pixel so we can reduce total number of spawn threads …

afaik, thread launch is essentially free, but any thread should setup some variables. f.e. each thread may need to compute

ptr = array + x*1024 + y
*ptr = 0

while when you process multiple pixels in single thread the code will be

ptr = array + x*1024
for (i=0..max_y)
  *ptr++ = 0

so the usual advice is to start with 1 thread = 1 pixel in order to simplify the code, and replace it with loops if you need a bit more speed