Extreme performance degredation (slowness) with NVCC in CUDA 8 compared to 7?

I noticed some extreme slowness when compiling our mixed C++/CUDA codebase with CUDA 8 as opposed to CUDA 7, so I ran a simulation:
On a 20 core (2 physical Xeon E5 processors) server, it’s almost twice as slow:
12m30s to compile with CUDA 7 cleanly, compared to 23m50s to compile with CUDA 8 cleanly.

There appears to be a lot of time wasted on cc1plus and then when linking.

Has anyone else experienced this?

Your question is confusing to me: It is not clear whether you are talking about build (compilation) time, or run time of the resulting application. The use of “to compile … cleanly” suggest the former, the use of “I ran a simulation” suggests the latter.

Assuming your data is from an apples-to-apples comparison, meaning same physical system platform with the same disk subsystem and the same system load, using the same compilation switches for the compiler, this is definitely worth reporting to NVIDIA as a bug (a bug reporting form is linked from the CUDA registered developer website).

The build time suggests a largish project, it will be essential to narrow down the problem as much as possible before filing the bug report. For example, can you find out which code in particular is responsible for the majority of the slowdown? You would want to submit the smallest possible self-contained set of code with which NVIDIA can reproduce the problem.

Just compilation time. I’ve not bothered with run-time performance tests yet. Just trying to figure out if I have a big problem on my hands.

It’s a VERY big project, which limits my ability to narrow it down without weeks of work.

I asked here to see if anyone has had a similar experience. If it’s just me, I’ll have to dive deeper into specific file compilation, and the libraries we work with…

If it would take someone familiar with the code weeks to narrow it down, just think how long it would take someone not familiar with the code. This is why it is standard practice in the compiler field to require as small a self-contained project as possible to reproduce when filing a bug.

I think the first thing you would want to do is due diligence to double-check whether you are really looking at a slowdown in an apples-to-apples comparison. Did the compilation flags change at all, e.g. debug vs release, optimization levels, GPU architecture targets added? Did the host compiler version change? Any changes to the system: change in amount of system memory, use of a different mass storage partition (e.g. an SSD vs a HDD), different NUMA control settings on this dual-socket machine? Is the compilation the only job running on this large server (unlikely)? If not, is the system load roughly the same during both runs? You mention longer link times: Link times are, in my experience, often limited by transfer of data to/from disk. So some other process may be hammering the disk, e.g. virus scanner?

If you instrument your build to log the time for individual compilations, is it actually the device compilation or the host compilation that takes more time now? I do not recognize cc1plus as a CUDA toolchain component, it would seem to be the a component of g++? Which would point the finger at the host compiler rather than the CUDA device-code compiler. There is a small, but non-zero, chance that the frontend code splitting performed by the CUDA toolchain sends different source code to g++ with CUDA 8.0, which causes g++ to slow down (because the splitting typically involves a little bit more than sending a verbatim copy of the host portion of the CUDA code to the host compiler).

To followup on this line of inquiry, you could identify an exemplary file for which the compilation time difference is very pronounced, and dump the intermediate files produced by the CUDA toolchain. I don’t off-hand remember which of the many files produced is sent to the host compiler, but it should not be that difficult to find out from looking at the contents. Do you see substantial differences in the files produced by CUDA 7.5 vs CUDA 8.0? For example, are additional header files included when processing the code with the CUDA 8.0 toolchain?

Considerable effort was put into CUDA 8 to make compile times shorter than CUDA 7.5
I personally have witnessed this, across a range of smallish codes.

I don’t know what the comparison would be against CUDA 7

Targeted efforts to reduce CUDA compilation times go back to about CUDA 6, if I recall correctly, so I would be very surprised to find code whose compilation times doubled in newer CUDA versions. But statistical outliers are always possible.

Since the question specifically mentions additional time spent in cc1plus and the linker, it seems the increase seen by the OP is due to host code compilation, not device code compilation.

Hi all,

It was a weird linking issue in the end. (We had three versions - CUDA 7, CUDA 8 RC1 and CUDA 8 final installed at the same time).

Removal of the previous versions SDK solved the linking slowness issues and now compilation time is close to what it was before (12m40s).

Thanks for your help!