CUDA 4.1 broke my kernel (won't be executed)

Hi,

My kernel is not working after I updated cuda from 4.0 to 4.1.

The original block dimension is 1024 threads and it work correctly under 4.0 but fail under 4.1(my kernel won’t be executed by GPU).
I checked the register usage and each thread needs 46 registers.
I’m using GTX 580 as platform and there are 32768 registers per block on GTX580.

I’ve tried to use 512 threads per block then my kernel works correctly under 4.1.
It obviously is the resource shortage problem cause my kernel fail.

My questions are:
1- Why the register shortage problem doesn’t shown under 4.0?
2- base on my knowledge, GPU will use global memory instead of register if the register usage is oversize. am I wrong??

Thanks in advance~


I’m using:
GTX 580
driver: ver. 285.05.33 (64bit)
Toolkit: 4.1
OS: linux ubuntu 10.10 (Kernel: 2.6.35-30-generic)

<>

Use a [font=“Courier New”]launch_bounds[/font] directive to tell the compiler about the intended execution configuration (so it can reduce register usage as appropriate). Check appendix B.18 of the Programming Guide.

As tera point out, programmers can control register usage via the launch_bounds attribute, so give that a try. In any event, running a single thread block of 1024 threads per SM may not give the best performance, in particular if the code uses __syncthreads().

How significant is the expansion in register use between CUDA 4.0 and CUDA 4.1? If it is fairly large factor, it would probably make sense to file a bug so the compiler team can investigate. I have not personally encountered any issues with significant increase in register pressure between CUDA 4.0 and CUDA 4.1 (in fact, I have seen a reduction in many cases), so I don’t have any insights as to what the trigger for this may be.