Kernel 2x slower when compiling with sm12 and above
I recently ported my kernels to openCL and while comparing performances I discovered that my code was really slower (ok not 2 times) when compiled with
compute_12,sm_12 or compute_13,sm_13 (~85msec)
than with
compute_10,sm_10 or compute_11,sm_11 (~55msec)

I'm running a [b]GTX 285[/b] (compute capabilites 1.3) with drivers [b]v296.17[/b]
and compiling with [b]Cuda toolkit 4.1[/b] (4.1.28.0) and VC2010.
Also tried with toolkit 4.2

I compared 1.1 and 1.2 ptx, and they are identical.
GPU load is ~95% in both cases.
I don't perform double operation, nor atomic or any exotic stuff.
Just a simple kernel dealing with a 2D texture.

Any idea where could this come from? Driver version? Toolkit?
Thanks a lot
I recently ported my kernels to openCL and while comparing performances I discovered that my code was really slower (ok not 2 times) when compiled with

compute_12,sm_12 or compute_13,sm_13 (~85msec)

than with

compute_10,sm_10 or compute_11,sm_11 (~55msec)



I'm running a GTX 285 (compute capabilites 1.3) with drivers v296.17

and compiling with Cuda toolkit 4.1 (4.1.28.0) and VC2010.

Also tried with toolkit 4.2



I compared 1.1 and 1.2 ptx, and they are identical.

GPU load is ~95% in both cases.

I don't perform double operation, nor atomic or any exotic stuff.

Just a simple kernel dealing with a 2D texture.



Any idea where could this come from? Driver version? Toolkit?

Thanks a lot

#1
Posted 04/28/2012 09:36 PM   
You might want to play with the --maxregcount option in the sm_12 case...

But hmm when you say the PTX is identical, the register count should (in priciple) be identical too.
You might want to play with the --maxregcount option in the sm_12 case...



But hmm when you say the PTX is identical, the register count should (in priciple) be identical too.

#2
Posted 04/30/2012 11:04 AM   
[quote name='cbuchner1' date='30 April 2012 - 12:04 PM' timestamp='1335783844' post='1402554']
You might want to play with the --maxregcount option in the sm_12 case...

But hmm when you say the PTX is identical, the register count should (in priciple) be identical too.
[/quote]
You are right. Same number of registers:
[code]1>ptxas info : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_12'
1>ptxas info : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_10'
1>ptxas info : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1][/code]
[quote name='cbuchner1' date='30 April 2012 - 12:04 PM' timestamp='1335783844' post='1402554']

You might want to play with the --maxregcount option in the sm_12 case...



But hmm when you say the PTX is identical, the register count should (in priciple) be identical too.



You are right. Same number of registers:

1>ptxas info    : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_12'

1>ptxas info : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1]

1>ptxas info : Compiling entry function '_Z15computeILb0ELb1EEvi' for 'sm_10'

1>ptxas info : Used 16 registers, 128+0 bytes lmem, 4+16 bytes smem, 68 bytes cmem[0], 12 bytes cmem[1]

#3
Posted 04/30/2012 11:47 AM   
[color="#222222"][font="arial, sans-serif"][size="2"]Here's a bug I filed -- it might be relevant to your problem.[/size][/font][/color]
[color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"]The bug is that when using __launch_bounds__, sm_12 is assigned registers as if it's an sm_11 target.[/size][/font][/color]
[color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"]ptxas -arch sm_13 works as expected. [/size][/font][/color]
[font="arial, sans-serif"] [/font][color="#222222"][font="arial, sans-serif"] [/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"]A possible workaround might be to use -maxrregcount=XX for sm_12 targets?[/size][/font][/color]
[color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"]Perhaps your problem might be solved by just targeting sm_13 if you're actually running on a CC 1.3 device.[/size][/font][/color]
[font="arial, sans-serif"] [/font][font="arial, sans-serif"] [/font][font="arial, sans-serif"] [/font][font="arial, sans-serif"] [/font][color="#222222"][font="arial, sans-serif"][size="2"]===[/size][/font][/color][color="#222222"][font="arial, sans-serif"][size="2"][b]
[/b][/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"][b]Subject: [/b]ptxas does not handle __launch_bounds__ properly with sm_12 targets[/size][/font][/color][color="#222222"][font="arial, sans-serif"][size="2"][b]
[/b][/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"][b]Description:[/b][/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]

[font="Arial"][color="#222222"][size="2"]If you provide __launch_bounds__ in a kernel and build for the sm_12 architecture with the OpenCC compiler then ptxas will only assign as many registers as there are in sm_11 (8192) instead of sm_12/sm_13 (16384).[/size][/color] [/font]

[font="Arial"][color="#222222"][size="2"]This occurs with both ptxas 4.1 and 5.0.[/size][/color] [/font][color="#222222"][font="arial, sans-serif"][size="2"][b]
[/b][/size][/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"][b]Example[/b]:[/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"] [/font][/color]
[color="#222222"][font="arial, sans-serif"][size="2"]A PTX file with:[/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color][font="courier new, monospace"] [/font]
[font="courier new, monospace"] [/font]
[font="courier new, monospace"][color="#3333ff"][size="2"].maxntid 256,1,1[/size][/color][/font][color="#222222"][font="arial, sans-serif"] [/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]

[font="Arial"][color="#222222"][size="2"]This should enable as many as 64 registers to be allocated per thread for sm_12 and sm_13 architectures since there are 16384 registers per SM.[/size][/color] [/font]

[color="#222222"][font="Arial"][size="2"]Instead this is what happens for sm_12 (incorrect):[/size][/font][/color][font="arial, sans-serif"] [/font]
[font="arial, sans-serif"] [/font][size="2"][color="#222222"][font="Courier New"][color="#3333ff"]>ptxas -arch sm_12 -m 32 -v foo.ptx[/color][/font][/color][font="Courier New"]
[color="#3333ff"]ptxas -arch sm_12 -m 32 -v foo.ptx[/color]
[color="#3333ff"]ptxas : info : Compiling entry function '_Z14bazKernelPj' for 'sm_12'[/color]
[color="#3333ff"]ptxas : info : Used 32 registers, 940+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/color][/font][color="#222222"][font="arial, sans-serif"] [/font][/color][/size][color="#222222"][font="arial, sans-serif"] [/font][/color]

[font="Arial"][color="#222222"][size="2"]But for sm_13 we see the correct allocation:[/size][/color][/font]
[font="Arial"][size="2"] [/size][/font] [size="2"][color="#222222"][font="Courier New"][color="#3333ff"]>ptxas -arch sm_13 -m 32 -v foo.ptx[/color][/font][/color][font="Courier New"]
[color="#3333ff"]ptxas -arch sm_13 -m 32 -v foo.ptx[/color]
[color="#3333ff"]ptxas : info : Compiling entry function '_Z14bazKernelPj' for 'sm_13'[/color]
[color="#3333ff"]ptxas : info : Used 64 registers, 192+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1][/color][/font][/size]

[font="Arial"][color="#222222"][size="2"]It appears sm_12 is being treated like sm_11. Running ptxas with "-arch sm_11" duplicates the sm_12 results.[/size][/color] [/font]

[color="#222222"][font="Arial"][size="2"]This was verified on both ptxas 4.1 and 5.0.[/size][/font][/color][color="#222222"][font="arial, sans-serif"] [/font][/color]
Here's a bug I filed -- it might be relevant to your problem.



The bug is that when using __launch_bounds__, sm_12 is assigned registers as if it's an sm_11 target.



ptxas -arch sm_13 works as expected.



A possible workaround might be to use -maxrregcount=XX for sm_12 targets?



Perhaps your problem might be solved by just targeting sm_13 if you're actually running on a CC 1.3 device.

===



Subject: ptxas does not handle __launch_bounds__ properly with sm_12 targets



Description:



If you provide __launch_bounds__ in a kernel and build for the sm_12 architecture with the OpenCC compiler then ptxas will only assign as many registers as there are in sm_11 (8192) instead of sm_12/sm_13 (16384).



This occurs with both ptxas 4.1 and 5.0.



Example:



A PTX file with:



.maxntid 256,1,1



This should enable as many as 64 registers to be allocated per thread for sm_12 and sm_13 architectures since there are 16384 registers per SM.



Instead this is what happens for sm_12 (incorrect):

>ptxas -arch sm_12 -m 32 -v foo.ptx

ptxas -arch sm_12 -m 32 -v foo.ptx

ptxas : info : Compiling entry function '_Z14bazKernelPj' for 'sm_12'

ptxas : info : Used 32 registers, 940+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1]



But for sm_13 we see the correct allocation:

>ptxas -arch sm_13 -m 32 -v foo.ptx

ptxas -arch sm_13 -m 32 -v foo.ptx

ptxas : info : Compiling entry function '_Z14bazKernelPj' for 'sm_13'

ptxas : info : Used 64 registers, 192+0 bytes lmem, 16+16 bytes smem, 128 bytes cmem[0], 36 bytes cmem[1]



It appears sm_12 is being treated like sm_11. Running ptxas with "-arch sm_11" duplicates the sm_12 results.



This was verified on both ptxas 4.1 and 5.0.

#4
Posted 05/24/2012 12:10 AM   
Scroll To Top