CUDA 9 RC Cooperative Groups Compile Error

I am trying the new CUDA 9 release candidate and am getting some new errors that I did not encounter with previous versions. I am trying to do a grid synchronization using cooperative groups on a Pascal TITAN Xp GPU.

I am implementing the grid sync like so

cooperative_groups::grid_group grid = cooperative_groups::this_grid();
cooperative_groups::sync(grid);

Compiling gives me the error

ptxas fatal   : Unresolved extern function 'cudaCGGetIntrinsicHandle'

I tried adding the -dc flag to my compile line which instead gives me the error

lib/myLib.so: undefined reference to `__cudaRegisterLinkedBinary_51_tmpxft_00006f11_00000000_6_myKernel_cpp1_ii_6f9e48ae'

If I comment out the sync lines, and don’t use the -dc flag everything compiles fine. Has anyone seen this issue or know how to go about fixing it?

You must compile with relocatable device code and linking. This is mentioned in the CUDA 9 RC programming guide.

So your compile without this at all will result in the ptxas unresolved external error.

When you compile with -dc you are picking up the necessary compile step but not the necessary device code linking, which is why you get the undefined reference.

I suggest for starters doing the entire process with nvcc and specifying -rdc=true

Once you get that working, if you want to create libs or otherwise split your compile and link steps, then you need to learn how to do proper rdc compile and linking, and there are only about 10^6 questions on the internet that cover that (not to mention the nvcc manual).

I’m getting the same error using the Nsight Eclipse and a new project.

Is there a setting inside of Nsight that I can set to handle the linking ?

For anyone that needs it … I found the answer regarding Nsight on Stack Overflow:

https://stackoverflow.com/questions/38260577/generating-relocatable-device-code-using-nvidia-nsight

The defined way to enable this capability for an Nsight EE project is to do so at project creation time. After selecting File…New…CUDA C/C++ Project, you will be presented with the project creation wizard/dialog. Enter a project name and click “Next”. You will then be taken to the “Basic settings” dialog page. Here you will see an option “Device linker mode:” and the choices will be “Whole program compilation” (default) or “Separate compilation”. If you select “Separate compilation”, then your project will be set up for relocatable device code generation. – Robert Crovella Jul 8 '16 at 10:42
2

After a project is created, you can also make this change by going to Project…Properties…Build…Settings. Here you will see a page similar to the one mentioned above in the “Basic settings” dialog page. You can similarly change “Device linker mode:” on this page from “Whole program compilation” to “Separate compilation” in order to turn on generation of relocatable device code, after the project has already been created. – Robert Crovella Jul 8 '16 at 10:47

Trying to get the exact same code as above working to synchronize the grid. After adding -rdc=true the project compiles without error but when run returns

device kernel image is invalid

The error is thrown at the first cuda function called - a cudaMalloc.

During the compilation there is a warning

nvlink warning : SM Arch (‘sm_30’) not found in ‘CMakeFiles/generated.o’

The code works correctly with no warning at compilation if the rdc flag and cooperative group code is removed. How can this be fixed?

Turns out it was the linking that was causing the issue. Added the line

string(APPEND CMAKE_CUDA_FLAGS " -gencode arch=compute_70,code=sm_70")

to the CMakeLists.txt file as suggested here to resolve the error.