My application uses CUDA kernels for the bulk of the computations. For good reasons (out of scope of this question) I use a shared object/linking model to dynamically load the object files each of which contains 1 host function and 1 CUDA kernel. Since kernels can not be extern the basic structure of such a “kernel” is:
__global__ kernel() { ...code... }
extern "C" void call_kernel() {
<<<GRID,BLOCK,SHMEM>>>kernel();
}
I use a host function which sole purpose is to call the kernel. For building the shared object i use:
nvcc -arch=sm_20 -m64 --compiler-options -fPIC,-shared -link -o kernel0.o kernel0.cu
The whole app uses lots of these kernels and they are loaded with dlopen(). The whole thing works fine if everything (building/loading/executing) stays on one machine A.
But when I compile/build the shared objects say on machine B (cuda 4.1, NVIDIA C2050), and dlopen them later on machine A (cuda 4.0, GTX 480) the computation does not yield the same result as if the shared objects were also build on machine A.
That sounds odd to me. Isn’t there a CUBIN object embedded in the .o file which contains instructions that are independent of the particular GPU architecture?
I know that it is advised to use the same compiler version for building and linking. Again, I have good reasons why not to build the shared objects on the same machine as they are executed.