Can nvcc generate code for multiple architectures in parallel

Cuda on Linux. I have a Cuda file with a very large function. Compiling that Cuda code dominates my build time when I’m able to run “make -j 8” so make can run 8 g++ compiles at a time for the rest of my program. I set up nvcc with the standard flags

-gencode=arch=compute_20,code=sm_20
-gencode=arch=compute_30,code=sm_30
-gencode=arch=compute_35,code=sm_35
-gencode=arch=compute_35,code=compute_35

to support multiple GPU architectures. But this results in one nvcc compile, which then generates the code for these four architectures one at a time. Is there any way to tell nvcc to run these four code generations in parallel, given I have enough CPU cores available?

As far as I know, there is no way to parallelize the building of fat binaries when using a single nvcc invocation, with nvcc assigning the build for each target architecture to a different thread. Strikes me as an excellent suggestion.

I would suggest filing an enhancement request via the bug reporting form linked from the registered developer website. Please prefix the synopsis with “RFE:” so it is readily recognizable as a request for enhancement rather than a true bug. Thanks!

In the meantime you should be able to manually parallelize the build by looking at the output of nvcc --verbose or nvcc --dryrun and invoking those commands directly instead of nvcc.

Njuffa: Bug 1504822 submitted as an enhancement request.

Tera: Thanks for the idea, but too ugly for my production environment.