Ptxas compiler speed.

Hello there.
I have very large kernel, about 25000 - 30000 lines and it compiled very slow(about 20 minutes) and ptxas occupy more part of time even with O0 optimization level. Why is it so? I can understand why it work slow on O1 - O3 optimization levels, but with O0 it nothing to do in my opinion.
Also, I rewrote code on inline ptx and use only asm instructions with optimization level O0, but all the same it work about 10 minutes and occupy 800 megs in RAM. Why?

Are you calling nvcc with [font=“Courier New”]–ptxas-options=-O0[/font]? Just giving [font=“Courier New”]-O0[/font] sets the optimization level for the host compiler, not for ptxas.
You could also try [font=“Courier New”]–ptxas-options=–allow-expensive-optimizations=false[/font].

Yes of course, I am using --ptxas-options=-O0.
I tried use --ptxas-options=–allow-expensive-optimizations=false, but compiler return that it is unknown option.
Is there is any decompiler for cubin files? May be possible understand what ptxas doing after decompilation of cubin?

You can disassemble cubin files with [font=“Courier New”]cuobjdump -sass[/font]. I’d guess it’ll take you a bit longer than 20 min to comprehend what ptxas has done to the 25000 lines of code.

How large is the PTX file? I’d assume your problem is just due to sheer code size.

Ptx file has about 75000 lines, 2.5 Mb.
As I understand, compiler without optimization(O0 level) has convert string to string ptx code to binary code. And I can not understand why it work so long.

If this happens with CUDA 4.1 on a reasonable fast, modern machine I would suggest filing a bug, because a compile time of 20 minutes seems too long. Please attach a self-contained repro case to your bug report. Since this is a PTXAS issue, this should be simple to do: Simply retain the intermediate PTX file by passing -keep to nvcc. In the bug report please also state the exact PTXAS invocation used to compile the file (nvcc -v will show how PTXAS is invoked).

Hi,

I am experiencing similar problem for long time. I have smaller kernels, hundreds of lines, and compilation takes few minutes, which is also very long. And it happens to me since CUDA 3 maybe even 2. In one version of CUDA, commenting certain 5 lines helped so I had impression, that the compiler is a bit bugy :-(.

Tomas.

Sorry, I made a mistake. I measured more correctly and get that ptx compiler(only ptx without OpenCC) work 7m 50s. It is not 20 minutes, but very slow too. I decompiled cubin and see that all ptxas work is convert ptx commands to asfermi commands line to line, and because I’m not understand what compiler do 7-8 minutes.
Example of PTX file I will attached tomorrow, if problem not solved.

My configuration:
DualCore E5400 @2700MHz
3Gb RAM
GTX580, 3Gb
Windows 7 Ultimate
CUDA 4.0

The CUDA compiler inlines much more aggressively than host compilers. Prior to sm_2x it did not have a choice as there was insufficient hardware support for an ABI with function calls. Even with sm_2x it inlines most functions as the size threshold for not inlining is set high. Inlined code includes user functions, standard math library functions, emulated device operations (e.g. integer division). Template expansion and the building of multi-device fat binaries also causes significant code expansion.

So a relatively small amount of source code can still balloon into pretty hefty machine code in a hurry. From what I understand, many problems in the field of compilers have exponential complexity, so lengthy code can drive up compilation times quickly. Heuristics are used to prevent compile time explosion, but there may still be particular combinations that lead to excessive compile times. In addition, in some areas like register allocation the GPU compiler has to work extra hard since register spilling is relatively more expensive on the GPU than on host processors.

Personally, I consider an end-to-end compilation time of over 10 minutes per file excessive (since an obstacle to programmer productivity) and recommend filing a bug against the compiler when that happens. Your individual threshold may differ. Please note that filing a bug report is the appropriate channel for having the problem looked into by the compiler team. Posting here could possibly result in some recomendations (e.g. to try noinline for user code functions on sm_2x to reduce code size), but it will not improve the compiler. Improvements to the compiler are the “rising tide that lifts all boats”, that is, they benefit all CUDA developers. Being a developer myself I realize that filing bug reports involves additional effort, so thank you for your help.

I have explored problem of slow compilation time and prepared bug report. I’m not allowed to post all source code because it’s commercial classified information, but I can describe kernel code. It contains about 400 lines of code, and about 70000 ptx asm commands.
At the end of kernel I have placed code:

FloatPoint center;

//All measures have accuracy +/- 3s.

////Case 1. Compilation time: 2m. 2s.
//center.x = 0;
//center.y = 0;

////Case 2. Compilation time: 2m. 12s.
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = 0;

////Case 3. Compilation time: 2m. 13s.
//center.x = 0;
//center.y = 0;
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = p0.y + p1.y + p2.y + p3.y;

////Case 4. Compilation time: 4m. 45s.
//center.x = 0;
//center.y = 0;
//center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
//center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;

//Case 5. Compilation time: 4m. 56s.
center.x = (p0.x + p1.x + p2.x + p3.x) / 4.0f;
center.y = (p0.y + p1.y + p2.y + p3.y) / 4.0f;

In comments you can find different variants of code and compilation time. It’s looked like magic. For example, If you add division by four you get +2m.32s. compilation time.

70,000 lines of PTX for 400 lines of kernel code seems a lot. How much of that is due to loop unrolling, and how much due to inlining? Note that partial loop unrolling works well with the current compiler (even for variable number of loop iterations), there is no need to fully unroll loops with high iteration counts.

I am not a compiler engineer, and it is impossible for me to deduce anything in particular from the snippets posted here. In general, compilable repro code is required to diagnose compiler issues. It is like having trouble with a car: the car mechanic needs to have a look at the car, just describing the symptoms is usually not enough to pinpoint the problem. Please note that bug reports and their attached files are visible only to the filer and appropriate NVIDIA personnel. As I stated previously, since the issue here is with PTXAS, only the intermediate PTX file is required in conjunction with the PTXAS commandline. The PTX file is a human readable text file that you can inspect; I think you will find that it sufficiently obfuscates any proprietary techniques used in the source code.

I wonder what happens if you replace all divisions by 4.0f with multiplies by 0.25f?

2terra:
There are not loops and 70k ptx lines do not contain unrolled loops.

2njuffa:
With multiplies by 0.25f result will be identical.

I solved problem by changing maxregcount from 20 to 32. It is strange, because I use optimization levels for opencc and ptxas O0, but as I understand, compiler optimize code all the same

All instruction scheduling and register allocation is handled inside PTXAS. These compiler stages take longer the more code there is and the longer the basic blocks are. Since register spilling is relatively expensive, PTXAS will try hard to avoid it and keep all data in registers, for example by selectively recomputing expressions instead of storing them in temp registers (as part of something called re-materialization).

The lower the register limit imposed on PTXAS, the harder it has to work on register allocation. In general the compiler picks sensible register limits for most code, and to first order programmers should not interfere with that (either through -maxrregcount or __launch_bounds) unless it proves necessary. A strategy I use myself when I think lowering the register count limit may be beneficial for performance is to first let the compiler pick the limit, then force lower limits in steps of multiple of four registers (e.g. 32 → 28 → 24, 
), measuring app performance at each step. I cannot recall a case in recent times where “squeezing” the register limit by more than four registers proved to be beneficial. The benefits of higher occupancy are counteracted by increased dynamic instruction count and / or spilling of registers. This also demonstrates that the compiler typically chooses reasonably tight register limits. Since the compiler uses a number of heuristics to make these decisions there may be occasional cases where the limit chose for some code is more significantly sub-optimal, but in my experience that is rare these days.

Note that a flag -O0 on the nvcc commandline is passed to the host compiler only, it does not affect the CUDA (device) compiler. When -Xptxas -O0 is specified, PTXAS does not optimize, but obviously it still has to do basic register allocation and scheduling. With CUDA 4.1 the compiler frontend for targets sm_20 and up is NVVM instead of Open64, so any -Xopencc -O0 flags will be ignored (since specific to the Open64 component that no longer comes into play), and nvcc will give an advisory message about this.

Yes, of course I use --opencc-options -O0 and --ptxas-options=-O0 options for optimization levels.

1 Like

Hi everyone,

I think it is time to complain on NVIDIA support team work related to this issue. We all know, that ptxas is way too slow under some circumstances. Besides my case, where ptxas is running for 20-35 minutes and more, and even eats all memory and stops working on one not so big portion of code, I know another person with the same issue on a quite different code, who submitted a bug also. I guess there are more people here suffering from extremely poor ptxas performance, though I don’t know if they filed bugreports or not.

The answer from the support I received today is bewildering. The question was if there is any hint on how serious the problem is and could we or not expect the solution in short time. The answer reads as: “Sorry for no updating since then! We are still investigating this issue, the developers are trying to find solutions. If there are any news we have, we will inform you as soon as possible. Sorry for any inconvenience!”. It’s just a pattern phrase saying nothing.

It’s quite strange that NVIDIA compiler team still needs time even to “investigate” the problem as it it a known issue for many months. I personally regard the work of support team on this issue most unsatisfactory and call on other developers suffering from the same problem to add your voice here.

My bug report is filed as #1158670.

There are occasional reports of excessive compile times with the CUDA toolchain, caused by PTXAS or other compiler components. Best I can tell from looking at various bugs of that nature the causes for lengthy compile times are varied and there is no single underlying mechanism. One thing that is common to most of these reports of lengthy compile times is a large code size.

Two mechanisms in particular can lead to drastic increases in code size: loop unrolling and function inlining. This suggests some generic workarounds:

(1) Loop unrolling can be suppressed with #pragma unroll 1, inserted directly before a loop.

(2) Function inlining of user functions can be suppressed with the noinline attribute.

(3) CUDA’s standard math functions are inlined. A large number of invocations of the more complicated functions can cause code size to explode. The largest functions are probably tgamma(), lgamma(), and pow(). In the case of pow(), alternatives with smaller code size (and faster execution) can sometimes be used: exp(), exp2(), exp10(), sqrt(), rsqrt(), cbrt(), rcbrt(), or even simple multiplication [i.e. use x*x instead of pow(x,2.0)].

I would like to encourage all CUDA users to keep filing bugs when they encounter excessive compile times. This is the best way to resolve the underlying issues and improve the product. As each bug tends to have a different cause, significant time may be needed for root cause analysis and code re-design. Keep in mind that normally multiple issues are being worked on concurrently to maximize the throughput.

New compiler versions are generally available with the next CUDA release, and the earliest access is typically for registered developers with access to early release candidates. For some, but not all, bugs the compiler team is able to suggest specific workarounds beyond the sort of generic pointers I gave above. In this case the bug filer will be notified after the workaround has been identified.

njuffa, (sorry, I don’t know your name)

It looks like you are NVIDIA employee, so I’d like to address several questions to you.

First, why your answer is a first comprehensive answer to my bug report since it was submitted to NVIDIA support site more then a month ago? Why your support team just ignores my direct questions and don’t want to give any concrete recomendations or explanations? Why it happens only here, on a forum?

Then several questions to the ptxas productivity topic:

One thing that is common to most of these reports of lengthy compile times is a large code size.

That is rather obvious, even though I think one case in my bug report could be regarded as a case when not so large function caused compiler malfunction. I mean the function which was called ‘ZN6KERNEL6ShellBIddNS_14BasicTypes_GPUEEEvPKNS_3InsIT_T0_T1_EEiPNS_6MatrixIS4_Li6ELi18EEES4_S4_S4’ in the bug report history. I didn’t isolate this case since I have very little time now as a result of all those sudden problems caused by ptxas, but if you need the proof, I can try to isolate it. I sent the function body to your support team according to their request. It has 70 lines, and contains 369 pow() calls. Is it a large function? Or is it large enough to hang up the ptxas?

But if we agree that most of the cases of slow ptxas work are caused by large code size – let’s set up the terminology – what do you mean by ‘large’? 6 million strings in ptx file is large? 1 million? Half a million? Do you think such amount of lines is a problem for ANY assembler on ANY enterprise CPU, specialized for HPC?

I suppose it is quite strange point of view to the productivity problem. One MUST consider the productivity problem of the compiler even on the point when a smallish program takes seconds to be compiled. Because it is quite obvious that on enterprise code it may result in MINUTES to compile. So the right point of view, to my opinion, is that ptxas have had productivity problems since it was created. And the problem is complex, it is quite odd to say, that it is individual problem of each code portion, and it is reasonable to investigate each bug report concurrently “to maximize the throughput”.

I attached a ptx file to my bug report and sent source code to you as a case, which seems to me a quite normal piece of code using CUSP library to solve SLAE with Conjugate Gradients and several preconditioners. Yes, it produces rather big ptx file, as CUSP is a template library. It takes 10 minutes for ptxas to process it. But it is only SLAE solver! Is it reasonable to consider such basic code a large one? How then can we tell anything about enterprise HPC computations?

I tried all the generic recommendations you suggested, with no effect.
Besides ‘#pragma unroll 1’ – it’s difficult to place this pragma everywhere.

Thanks for explaining the process how the compiler team releases fixes and workarounds. Finally, we have enough information to understand, that we have the only choice – to postpone the planned release of our product. We all are happy.

Sorry for the inconvenience of slow compiles. I am a CUDA user myself and understand that this can be frustrating. Thank you for filing a bug, this will help drive changes that allow PTXAS to deal with lengthy code more efficiently in the future.

Largely due to historical constraints of the underlying hardware, inlining is used extensively in CUDA, and it can lead to much larger code sizes than one would get on the host for the same source. The larger code size contributes to lengthy compile times, along with other features of the GPU such as variable number of registers which introduces addtional performance trade-offs traditional compilers do not need to consider.

I would consider 369 calls to pow() in a single function a lot. Is this single or double precision code. I would have to check, but I think the double-precision pow() expands to something like 200 instructions, so with all calls inlined, the small function you mention would expand to around 70K instructions. Not so small anymore.

In terms of a workaround that allows you to mmake forward progress on your project, is there any way to reduce the number of pow() calls, or simplify particular instances? For example, there are more efficient functions available for special cases of exponentiation, enumerated in my previous post. Also, if the exponents happen to be integers, I suggest invoking pow(double,int) instead of pow(double,double) as the former function results in much smaller code. For squares and cubes it is best to use multiplies directly, i.e. xx and xx*x. I believe this is also recommended in the Best Practices Guide.

Another workaround you could try for the kernel containing the 369 calls to pow() is to reduce the PTXAS optimization level. It defaults to -O3, and thus includes fairly sophisticated optimizations that may require a lot of time when the code is large. You can gradually decrease the optimization level for PTXAS (-O2, -O1, -O0) by passing the following on the NVCC command line:

-Xptxas -O{0|1|2|3}

I cannot guarantee that this will reduce compilation time to something more reasonable but I think it is worth a try.