pxtas can't find symbol, but it's definitely in the object file?

jeremy123 · February 5, 2016, 4:18am

Hi,

I’m trying to fix a much more complicated problem in a real project related to separating device functions into multiple files, and I’m trying to understand how separable compilation works right now. Unfortunately, I can’t even get my trivial example to link properly.

kernel1.h:

#ifndef _KERNEL1_H_
#define _KERNEL1_H_

__device__ __host__ void adder(int *a, int *b, int *x);

#endif

kernel2.h:

#ifndef _KERNEL2_H_
#define _KERNEL2_H_

__device__ __host__ void doubler(int *a, int *x);

#endif

kernel1.cu:

#include "kernel1.h"

__device__ __host__ void adder(int *a, int *b, int *x) {
    *x = *a + *b;
}

kernel2.cu:

#include "kernel1.h"
#include "kernel2.h"

__device__ __host__ void doubler(int *a, int *x) {
    adder(a, a, x);
}

main.cu:

#include <cuda.h>
#include <cstdio>

#include "kernel2.h"

__global__ void launch_doubler(int *a, int *x) {
    doubler(a, x);
}

int main(void) {
    int *a;
    int *x;

    int a_host = 5;

    cudaMalloc((void **)&a, sizeof(int));
    cudaMalloc((void **)&x, sizeof(int));
    cudaMemcpy(a, &a_host, sizeof(int), cudaMemcpyHostToDevice);

    launch_doubler<<<1, 1>>>(a, x);

    cudaMemcpy(&a_host, a, sizeof(int), cudaMemcpyDeviceToHost);

    printf("%i\n", a_host);

    cudaFree(a);
    cudaFree(x);
}

build.sh:

#!/bin/sh

set -o verbose

nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel1.cu -o kernel1.o
nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel2.cu -o kernel2.o
nvcc -gencode arch=compute_35,code=sm_35 -dlink kernel1.o kernel2.o -o kernels.o

nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 main.cu kernels.o

Unfortunately, when I run build.sh, I get the following error:

ptxas fatal   : Unresolved extern function '_Z7doublerPiS_'

and yet, the symbol does exist in the object file:

$ strings kernels.o | grep _Z7doublerPiS_
.text._Z7doublerPiS_
.nv.info._Z7doublerPiS_
.rel.text._Z7doublerPiS_
_Z7doublerPiS_
.text._Z7doublerPiS_
.nv.info._Z7doublerPiS_
.rel.text._Z7doublerPiS_

I also get the same pxtas error if I run

$ nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 main.cu kernel1.cu kernel2.cu
ptxas fatal   : Unresolved extern function '_Z7doublerPiS_'

Can someone please help me understand what is happening here? If it helps, my version info is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

Robert_Crovella · February 5, 2016, 5:01am

Your second method:

$ nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 main.cu kernel1.cu kernel2.cu

is broken because you are not invoking separate compilation/device linking, and whenever a cuda device code in one file (compilation unit) calls a cuda device code in a another file, separate compilation/device-linking is required.

Regarding your first method, if you look carefully at the examples given in the nvcc documentation for separate compilation:

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#examples

You will find 2 ways to fix this.

Your existing method is separating the device-compile, device-link, and final-link steps into 3 operations. The intermediate device-link step should only be needed if the final-link is being performed by the host linker. If you are doing final link with nvcc, just pass the device-compiled objects (along with main.cu, in this case) and specify -rdc=true:

#!/bin/sh

set -o verbose

nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel1.cu -o kernel1.o
nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel2.cu -o kernel2.o

nvcc -std c++11 -rdc=true -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 main.cu kernel1.o kernel2.o

If you want an intermediate device-link step, then you should device-compile all your modules (.cu files) and then perform the device link step (on all the previously created objects), and then do the final link with the host compiler, while passing all objects (both the device-linked and the non-device linked objects) to the host compiler:

#!/bin/sh

set -o verbose

nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel1.cu -o kernel1.o
nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc kernel2.cu -o kernel2.o
nvcc -std c++11 -Xcompiler -Wall -gencode arch=compute_35,code=sm_35 -dc main.cu -o main.o
nvcc -gencode arch=compute_35,code=sm_35 -dlink main.o kernel1.o kernel2.o -o kernels.o

g++ main.o kernel1.o kernel2.o kernels.o -o test -L/usr/local/cuda/lib64 -lcudart

jeremy123 · February 5, 2016, 6:58am

Hi txbob,

Thanks so much! That makes sense now, I missed the -rdc flag.