Kernel 2x slower when compiling with sm12 and above

I recently ported my kernels to openCL and while comparing performances I discovered that my code was really slower (ok not 2 times) when compiled with
compute_12,sm_12 or compute_13,sm_13 (~85msec)
than with
compute_10,sm_10 or compute_11,sm_11 (~55msec)

I’m running a GTX 285 (compute capabilites 1.3) with drivers v296.17
and compiling with Cuda toolkit 4.1 (4.1.28.0) and VC2010.
Also tried with toolkit 4.2

I compared 1.1 and 1.2 ptx, and they are identical.
GPU load is ~95% in both cases.
I don’t perform double operation, nor atomic or any exotic stuff.
Just a simple kernel dealing with a 2D texture.

Any idea where could this come from? Driver version? Toolkit?
Thanks a lot

You might want to play with the --maxregcount option in the sm_12 case…

But hmm when you say the PTX is identical, the register count should (in priciple) be identical too.

You are right. Same number of registers:

1>ptxas info    : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_12'

1>ptxas info    : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1]

1>ptxas info    : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_10'

1>ptxas info    : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1]

[font=“arial, sans-serif”]Here’s a bug I filed – it might be relevant to your problem.[/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]The bug is that when using launch_bounds, sm_12 is assigned registers as if it’s an sm_11 target.[/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]ptxas -arch sm_13 works as expected. [/font]

[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]A possible workaround might be to use -maxrregcount=XX for sm_12 targets?[/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]Perhaps your problem might be solved by just targeting sm_13 if you’re actually running on a CC 1.3 device.[/font]

[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]===[/font][font=“arial, sans-serif”][b]

[/b][/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]Subject: ptxas does not handle launch_bounds properly with sm_12 targets[/font][font=“arial, sans-serif”][b]

[/b][/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]Description:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]

[font=“Arial”]If you provide launch_bounds in a kernel and build for the sm_12 architecture with the OpenCC compiler then ptxas will only assign as many registers as there are in sm_11 (8192) instead of sm_12/sm_13 (16384). [/font]

[font=“Arial”]This occurs with both ptxas 4.1 and 5.0. [/font][font=“arial, sans-serif”][b]

[/b][/font]

[font=“arial, sans-serif”]Example:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]A PTX file with:[/font][font=“arial, sans-serif”] [/font][font=“courier new, monospace”] [/font]

[font=“courier new, monospace”] [/font]

[font=“courier new, monospace”].maxntid 256,1,1[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]

[font=“Arial”]This should enable as many as 64 registers to be allocated per thread for sm_12 and sm_13 architectures since there are 16384 registers per SM. [/font]

[font=“Arial”]Instead this is what happens for sm_12 (incorrect):[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] [/font][font=“Courier New”]>ptxas -arch sm_12 -m 32 -v foo.ptx[/font][font=“Courier New”]

ptxas -arch sm_12 -m 32 -v foo.ptx

ptxas : info : Compiling entry function ‘_Z14bazKernelPj’ for ‘sm_12’

ptxas : info : Used 32 registers, 940+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]

[font=“Arial”]But for sm_13 we see the correct allocation:[/font]

[font=“Arial”] [/font] [font=“Courier New”]>ptxas -arch sm_13 -m 32 -v foo.ptx[/font][font=“Courier New”]

ptxas -arch sm_13 -m 32 -v foo.ptx

ptxas : info : Compiling entry function ‘_Z14bazKernelPj’ for ‘sm_13’

ptxas : info : Used 64 registers, 192+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/font]

[font=“Arial”]It appears sm_12 is being treated like sm_11. Running ptxas with “-arch sm_11” duplicates the sm_12 results. [/font]

[font=“Arial”]This was verified on both ptxas 4.1 and 5.0.[/font][font=“arial, sans-serif”] [/font]