Ptxas compiler speed.
  1 / 2    
Hello there.
I have very large kernel, about 25000 - 30000 lines and it compiled very slow(about 20 minutes) and ptxas occupy more part of time even with O0 optimization level. Why is it so? I can understand why it work slow on O1 - O3 optimization levels, but with O0 it nothing to do in my opinion.
Also, I rewrote code on inline ptx and use only asm instructions with optimization level O0, but all the same it work about 10 minutes and occupy 800 megs in RAM. Why?
Hello there.

I have very large kernel, about 25000 - 30000 lines and it compiled very slow(about 20 minutes) and ptxas occupy more part of time even with O0 optimization level. Why is it so? I can understand why it work slow on O1 - O3 optimization levels, but with O0 it nothing to do in my opinion.

Also, I rewrote code on inline ptx and use only asm instructions with optimization level O0, but all the same it work about 10 minutes and occupy 800 megs in RAM. Why?

#1
Posted 02/29/2012 07:43 AM   
Are you calling nvcc with [font="Courier New"]--ptxas-options=-O0[/font]? Just giving [font="Courier New"]-O0[/font] sets the optimization level for the host compiler, not for ptxas.
You could also try [font="Courier New"]--ptxas-options=--allow-expensive-optimizations=false[/font].
Are you calling nvcc with --ptxas-options=-O0? Just giving -O0 sets the optimization level for the host compiler, not for ptxas.

You could also try --ptxas-options=--allow-expensive-optimizations=false.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 02/29/2012 11:16 AM   
Yes of course, I am using --ptxas-options=-O0.
I tried use --ptxas-options=--allow-expensive-optimizations=false, but compiler return that it is unknown option.
Is there is any decompiler for cubin files? May be possible understand what ptxas doing after decompilation of cubin?
Yes of course, I am using --ptxas-options=-O0.

I tried use --ptxas-options=--allow-expensive-optimizations=false, but compiler return that it is unknown option.

Is there is any decompiler for cubin files? May be possible understand what ptxas doing after decompilation of cubin?

#3
Posted 02/29/2012 11:38 AM   
You can disassemble cubin files with [font="Courier New"]cuobjdump -sass[/font]. I'd guess it'll take you a bit longer than 20 min to comprehend what ptxas has done to the 25000 lines of code.

How large is the PTX file? I'd assume your problem is just due to sheer code size.
You can disassemble cubin files with cuobjdump -sass. I'd guess it'll take you a bit longer than 20 min to comprehend what ptxas has done to the 25000 lines of code.



How large is the PTX file? I'd assume your problem is just due to sheer code size.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 02/29/2012 12:00 PM   
Ptx file has about 75000 lines, 2.5 Mb.
As I understand, compiler without optimization(O0 level) has convert string to string ptx code to binary code. And I can not understand why it work so long.
Ptx file has about 75000 lines, 2.5 Mb.

As I understand, compiler without optimization(O0 level) has convert string to string ptx code to binary code. And I can not understand why it work so long.

#5
Posted 02/29/2012 02:08 PM   
If this happens with CUDA 4.1 on a reasonable fast, modern machine I would suggest filing a bug, because a compile time of 20 minutes seems too long. Please attach a self-contained repro case to your bug report. Since this is a PTXAS issue, this should be simple to do: Simply retain the intermediate PTX file by passing -keep to nvcc. In the bug report please also state the exact PTXAS invocation used to compile the file (nvcc -v will show how PTXAS is invoked).
If this happens with CUDA 4.1 on a reasonable fast, modern machine I would suggest filing a bug, because a compile time of 20 minutes seems too long. Please attach a self-contained repro case to your bug report. Since this is a PTXAS issue, this should be simple to do: Simply retain the intermediate PTX file by passing -keep to nvcc. In the bug report please also state the exact PTXAS invocation used to compile the file (nvcc -v will show how PTXAS is invoked).

#6
Posted 02/29/2012 06:20 PM   
Hi,

I am experiencing similar problem for long time. I have smaller kernels, hundreds of lines, and compilation takes few minutes, which is also very long. And it happens to me since CUDA 3 maybe even 2. In one version of CUDA, commenting certain 5 lines helped so I had impression, that the compiler is a bit bugy :-(.

Tomas.
Hi,



I am experiencing similar problem for long time. I have smaller kernels, hundreds of lines, and compilation takes few minutes, which is also very long. And it happens to me since CUDA 3 maybe even 2. In one version of CUDA, commenting certain 5 lines helped so I had impression, that the compiler is a bit bugy :-(.



Tomas.

#7
Posted 03/01/2012 08:10 AM   
Sorry, I made a mistake. I measured more correctly and get that ptx compiler(only ptx without OpenCC) work 7m 50s. It is not 20 minutes, but very slow too. I decompiled cubin and see that all ptxas work is convert ptx commands to asfermi commands line to line, and because I'm not understand what compiler do 7-8 minutes.
Example of PTX file I will attached tomorrow, if problem not solved.

My configuration:
DualCore E5400 @2700MHz
3Gb RAM
GTX580, 3Gb
Windows 7 Ultimate
CUDA 4.0
Sorry, I made a mistake. I measured more correctly and get that ptx compiler(only ptx without OpenCC) work 7m 50s. It is not 20 minutes, but very slow too. I decompiled cubin and see that all ptxas work is convert ptx commands to asfermi commands line to line, and because I'm not understand what compiler do 7-8 minutes.

Example of PTX file I will attached tomorrow, if problem not solved.



My configuration:

DualCore E5400 @2700MHz

3Gb RAM

GTX580, 3Gb

Windows 7 Ultimate

CUDA 4.0

#8
Posted 03/01/2012 10:45 AM   
The CUDA compiler inlines much more aggressively than host compilers. Prior to sm_2x it did not have a choice as there was insufficient hardware support for an ABI with function calls. Even with sm_2x it inlines most functions as the size threshold for not inlining is set high. Inlined code includes user functions, standard math library functions, emulated device operations (e.g. integer division). Template expansion and the building of multi-device fat binaries also causes significant code expansion.

So a relatively small amount of source code can still balloon into pretty hefty machine code in a hurry. From what I understand, many problems in the field of compilers have exponential complexity, so lengthy code can drive up compilation times quickly. Heuristics are used to prevent compile time explosion, but there may still be particular combinations that lead to excessive compile times. In addition, in some areas like register allocation the GPU compiler has to work extra hard since register spilling is relatively more expensive on the GPU than on host processors.

Personally, I consider an end-to-end compilation time of over 10 minutes per file excessive (since an obstacle to programmer productivity) and recommend filing a bug against the compiler when that happens. Your individual threshold may differ. Please note that filing a bug report is the appropriate channel for having the problem looked into by the compiler team. Posting here could possibly result in some recomendations (e.g. to try __noinline__ for user code functions on sm_2x to reduce code size), but it will not improve the compiler. Improvements to the compiler are the "rising tide that lifts all boats", that is, they benefit all CUDA developers. Being a developer myself I realize that filing bug reports involves additional effort, so thank you for your help.
The CUDA compiler inlines much more aggressively than host compilers. Prior to sm_2x it did not have a choice as there was insufficient hardware support for an ABI with function calls. Even with sm_2x it inlines most functions as the size threshold for not inlining is set high. Inlined code includes user functions, standard math library functions, emulated device operations (e.g. integer division). Template expansion and the building of multi-device fat binaries also causes significant code expansion.



So a relatively small amount of source code can still balloon into pretty hefty machine code in a hurry. From what I understand, many problems in the field of compilers have exponential complexity, so lengthy code can drive up compilation times quickly. Heuristics are used to prevent compile time explosion, but there may still be particular combinations that lead to excessive compile times. In addition, in some areas like register allocation the GPU compiler has to work extra hard since register spilling is relatively more expensive on the GPU than on host processors.



Personally, I consider an end-to-end compilation time of over 10 minutes per file excessive (since an obstacle to programmer productivity) and recommend filing a bug against the compiler when that happens. Your individual threshold may differ. Please note that filing a bug report is the appropriate channel for having the problem looked into by the compiler team. Posting here could possibly result in some recomendations (e.g. to try __noinline__ for user code functions on sm_2x to reduce code size), but it will not improve the compiler. Improvements to the compiler are the "rising tide that lifts all boats", that is, they benefit all CUDA developers. Being a developer myself I realize that filing bug reports involves additional effort, so thank you for your help.

#9
Posted 03/01/2012 05:35 PM   
I have explored problem of slow compilation time and prepared bug report. I'm not allowed to post all source code because it's commercial classified information, but I can describe kernel code. It contains about 400 lines of code, and about 70000 ptx asm commands.
At the end of kernel I have placed code:

FloatPoint center;

//All measures have accuracy +/- 3s.

////Case 1. Compilation time: 2m. 2s.
//center.x = 0;
//center.y = 0;

////Case 2. Compilation time: 2m. 12s.
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = 0;

////Case 3. Compilation time: 2m. 13s.
//center.x = 0;
//center.y = 0;
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = p0.y + p1.y + p2.y + p3.y;

////Case 4. Compilation time: 4m. 45s.
//center.x = 0;
//center.y = 0;
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;

//Case 5. Compilation time: 4m. 56s.
center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;


In comments you can find different variants of code and compilation time. It's looked like magic. For example, If you add division by four you get +2m.32s. compilation time.
I have explored problem of slow compilation time and prepared bug report. I'm not allowed to post all source code because it's commercial classified information, but I can describe kernel code. It contains about 400 lines of code, and about 70000 ptx asm commands.

At the end of kernel I have placed code:



FloatPoint center;



//All measures have accuracy +/- 3s.



////Case 1. Compilation time: 2m. 2s.

//center.x = 0;

//center.y = 0;



////Case 2. Compilation time: 2m. 12s.

//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;

//center.y = 0;



////Case 3. Compilation time: 2m. 13s.

//center.x = 0;

//center.y = 0;

//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;

//center.y = p0.y + p1.y + p2.y + p3.y;



////Case 4. Compilation time: 4m. 45s.

//center.x = 0;

//center.y = 0;

//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;

//center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;



//Case 5. Compilation time: 4m. 56s.

center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;

center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;





In comments you can find different variants of code and compilation time. It's looked like magic. For example, If you add division by four you get +2m.32s. compilation time.

#10
Posted 03/06/2012 10:44 AM   
70,000 lines of PTX for 400 lines of kernel code seems a lot. How much of that is due to loop unrolling, and how much due to inlining? Note that partial loop unrolling works well with the current compiler (even for variable number of loop iterations), there is no need to fully unroll loops with high iteration counts.
70,000 lines of PTX for 400 lines of kernel code seems a lot. How much of that is due to loop unrolling, and how much due to inlining? Note that partial loop unrolling works well with the current compiler (even for variable number of loop iterations), there is no need to fully unroll loops with high iteration counts.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#11
Posted 03/06/2012 11:42 AM   
I am not a compiler engineer, and it is impossible for me to deduce anything in particular from the snippets posted here. In general, compilable repro code is required to diagnose compiler issues. It is like having trouble with a car: the car mechanic needs to have a look at the car, just describing the symptoms is usually not enough to pinpoint the problem. Please note that bug reports and their attached files are visible only to the filer and appropriate NVIDIA personnel. As I stated previously, since the issue here is with PTXAS, only the intermediate PTX file is required in conjunction with the PTXAS commandline. The PTX file is a human readable text file that you can inspect; I think you will find that it sufficiently obfuscates any proprietary techniques used in the source code.

I wonder what happens if you replace all divisions by 4.0f with multiplies by 0.25f?
I am not a compiler engineer, and it is impossible for me to deduce anything in particular from the snippets posted here. In general, compilable repro code is required to diagnose compiler issues. It is like having trouble with a car: the car mechanic needs to have a look at the car, just describing the symptoms is usually not enough to pinpoint the problem. Please note that bug reports and their attached files are visible only to the filer and appropriate NVIDIA personnel. As I stated previously, since the issue here is with PTXAS, only the intermediate PTX file is required in conjunction with the PTXAS commandline. The PTX file is a human readable text file that you can inspect; I think you will find that it sufficiently obfuscates any proprietary techniques used in the source code.



I wonder what happens if you replace all divisions by 4.0f with multiplies by 0.25f?

#12
Posted 03/06/2012 04:38 PM   
2terra:
There are not loops and 70k ptx lines do not contain unrolled loops.

2njuffa:
With multiplies by 0.25f result will be identical.

I solved problem by changing maxregcount from 20 to 32. It is strange, because I use optimization levels for opencc and ptxas O0, but as I understand, compiler optimize code all the same
2terra:

There are not loops and 70k ptx lines do not contain unrolled loops.



2njuffa:

With multiplies by 0.25f result will be identical.



I solved problem by changing maxregcount from 20 to 32. It is strange, because I use optimization levels for opencc and ptxas O0, but as I understand, compiler optimize code all the same

#13
Posted 03/07/2012 10:03 AM   
All instruction scheduling and register allocation is handled inside PTXAS. These compiler stages take longer the more code there is and the longer the basic blocks are. Since register spilling is relatively expensive, PTXAS will try hard to avoid it and keep all data in registers, for example by selectively recomputing expressions instead of storing them in temp registers (as part of something called re-materialization).

The lower the register limit imposed on PTXAS, the harder it has to work on register allocation. In general the compiler picks sensible register limits for most code, and to first order programmers should not interfere with that (either through -maxrregcount or __launch_bounds) unless it proves necessary. A strategy I use myself when I think lowering the register count limit may be beneficial for performance is to first let the compiler pick the limit, then force lower limits in steps of multiple of four registers (e.g. 32 -> 28 -> 24, ...), measuring app performance at each step. I cannot recall a case in recent times where "squeezing" the register limit by more than four registers proved to be beneficial. The benefits of higher occupancy are counteracted by increased dynamic instruction count and / or spilling of registers. This also demonstrates that the compiler typically chooses reasonably tight register limits. Since the compiler uses a number of heuristics to make these decisions there may be occasional cases where the limit chose for some code is more significantly sub-optimal, but in my experience that is rare these days.

Note that a flag -O0 on the nvcc commandline is passed to the host compiler only, it does not affect the CUDA (device) compiler. When -Xptxas -O0 is specified, PTXAS does not optimize, but obviously it still has to do basic register allocation and scheduling. With CUDA 4.1 the compiler frontend for targets sm_20 and up is NVVM instead of Open64, so any -Xopencc -O0 flags will be ignored (since specific to the Open64 component that no longer comes into play), and nvcc will give an advisory message about this.
All instruction scheduling and register allocation is handled inside PTXAS. These compiler stages take longer the more code there is and the longer the basic blocks are. Since register spilling is relatively expensive, PTXAS will try hard to avoid it and keep all data in registers, for example by selectively recomputing expressions instead of storing them in temp registers (as part of something called re-materialization).



The lower the register limit imposed on PTXAS, the harder it has to work on register allocation. In general the compiler picks sensible register limits for most code, and to first order programmers should not interfere with that (either through -maxrregcount or __launch_bounds) unless it proves necessary. A strategy I use myself when I think lowering the register count limit may be beneficial for performance is to first let the compiler pick the limit, then force lower limits in steps of multiple of four registers (e.g. 32 -> 28 -> 24, ...), measuring app performance at each step. I cannot recall a case in recent times where "squeezing" the register limit by more than four registers proved to be beneficial. The benefits of higher occupancy are counteracted by increased dynamic instruction count and / or spilling of registers. This also demonstrates that the compiler typically chooses reasonably tight register limits. Since the compiler uses a number of heuristics to make these decisions there may be occasional cases where the limit chose for some code is more significantly sub-optimal, but in my experience that is rare these days.



Note that a flag -O0 on the nvcc commandline is passed to the host compiler only, it does not affect the CUDA (device) compiler. When -Xptxas -O0 is specified, PTXAS does not optimize, but obviously it still has to do basic register allocation and scheduling. With CUDA 4.1 the compiler frontend for targets sm_20 and up is NVVM instead of Open64, so any -Xopencc -O0 flags will be ignored (since specific to the Open64 component that no longer comes into play), and nvcc will give an advisory message about this.

#14
Posted 03/07/2012 06:38 PM   
Yes, of course I use --opencc-options -O0 and --ptxas-options=-O0 options for optimization levels.
Yes, of course I use --opencc-options -O0 and --ptxas-options=-O0 options for optimization levels.

#15
Posted 03/09/2012 08:47 AM   
  1 / 2    
Scroll To Top