[font=“arial, sans-serif”]Here’s a bug I filed – it might be relevant to your problem.[/font]
[font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]The bug is that when using launch_bounds, sm_12 is assigned registers as if it’s an sm_11 target.[/font]
[font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]ptxas -arch sm_13 works as expected. [/font]
[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]A possible workaround might be to use -maxrregcount=XX for sm_12 targets?[/font]
[font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]Perhaps your problem might be solved by just targeting sm_13 if you’re actually running on a CC 1.3 device.[/font]
[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]===[/font][font=“arial, sans-serif”][b]
[/b][/font][font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]Subject: ptxas does not handle launch_bounds properly with sm_12 targets[/font][font=“arial, sans-serif”][b]
[/b][/font][font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]Description:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]
[font=“Arial”]If you provide launch_bounds in a kernel and build for the sm_12 architecture with the OpenCC compiler then ptxas will only assign as many registers as there are in sm_11 (8192) instead of sm_12/sm_13 (16384). [/font]
[font=“Arial”]This occurs with both ptxas 4.1 and 5.0. [/font][font=“arial, sans-serif”][b]
[/b][/font]
[font=“arial, sans-serif”]Example:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”]A PTX file with:[/font][font=“arial, sans-serif”] [/font][font=“courier new, monospace”] [/font]
[font=“courier new, monospace”] [/font]
[font=“courier new, monospace”].maxntid 256,1,1[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]
[font=“Arial”]This should enable as many as 64 registers to be allocated per thread for sm_12 and sm_13 architectures since there are 16384 registers per SM. [/font]
[font=“Arial”]Instead this is what happens for sm_12 (incorrect):[/font][font=“arial, sans-serif”] [/font]
[font=“arial, sans-serif”] [/font][font=“Courier New”]>ptxas -arch sm_12 -m 32 -v foo.ptx[/font][font=“Courier New”]
ptxas -arch sm_12 -m 32 -v foo.ptx
ptxas : info : Compiling entry function ‘_Z14bazKernelPj’ for ‘sm_12’
ptxas : info : Used 32 registers, 940+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”] [/font]
[font=“Arial”]But for sm_13 we see the correct allocation:[/font]
[font=“Arial”] [/font] [font=“Courier New”]>ptxas -arch sm_13 -m 32 -v foo.ptx[/font][font=“Courier New”]
ptxas -arch sm_13 -m 32 -v foo.ptx
ptxas : info : Compiling entry function ‘_Z14bazKernelPj’ for ‘sm_13’
ptxas : info : Used 64 registers, 192+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/font]
[font=“Arial”]It appears sm_12 is being treated like sm_11. Running ptxas with “-arch sm_11” duplicates the sm_12 results. [/font]
[font=“Arial”]This was verified on both ptxas 4.1 and 5.0.[/font][font=“arial, sans-serif”] [/font]