Can a cuda kernel call CUBLAS function or how to call a cublas functions from Python ?

Hello ,

Below code gave me this err0r:

"ptxas fatal : Unresolved extern function ‘cublasCreate_v2’
"

If I cannot call CUBLAS functions from kernels, how can I normally call them from Python ? If I try to do I get name mangling errors even if I used extern “C” keyword before function declarations:

Code snippet:

global void cublas_gemm(float* A, float* B, float* C, int m, int n, int k){

	cublasHandle_t handle;
	cublasCreate(&handle);

	cublasStatus_t status = cublasCreate(&handle);
	

	float alpha = 1.0, beta = 0.0;
	
	 cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, B, m, A, n, &beta, C, m);

}

(1) Yes, CUBLAS also comes in a version callable from device code. See documentation. There is probably a relevant example program among the myriad sample apps provide by NVIDIA as part of the CUDA distribution. [Later:] I think the app you want to look at is simpleDevLibCUBLAS:

(2) I am pretty sure that CUBLAS bindings for Python exist. Have you checked what is offered by PyCUDA? [Later:] The PyCUDA FAQ list says (https://wiki.tiker.net/PyCuda/FrequentlyAskedQuestions#Are_the_CUBLAS_APIs_available_via_PyCUDA.3F):

Hello,

I looked at the example simpleDevLibCUBLAS: This calls the CublasAPIs from a kernel, and this kernel is called from another normal “C”-like function. This “C”-like function is then called from the simpleDevLibCUBLAS.cpp by including it’s reference as doing a forward declaration.

Similarly, How can I simply call this “C”-like function from Python like it’s done in .cpp file ?
I tried using PyCuda in Python file as below code snippet but I got following errors :

pycuda.driver.CompileError: nvcc compilation of /tmp/tmpONE9yY/kernel.cu failed
[command: nvcc --cubin -I/usr/local/cuda-8.0/samples/common/inc/ -arch sm_61 -I/usr/lib64/python2.7/site-packages/pycuda-2017.1.1-py2.7-linux-x86_64.egg/pycuda/cuda kernel.cu]
[stderr:
/usr/include/c++/4.8/bits/memoryfwd.h(63): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/memoryfwd.h(66): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/memoryfwd.h(70): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stringfwd.h(52): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stringfwd.h(55): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stringfwd.h(59): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stringfwd.h(65): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/postypes.h(111): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/postypes.h(214): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/postypes.h(219): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(76): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(79): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(82): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(85): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(88): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(91): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(95): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(99): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(103): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(107): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(110): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(113): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(116): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(119): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/iosfwd(122): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(54): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(67): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(70): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(73): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(76): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(99): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(111): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(114): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(117): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(120): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/ext/numeric_traits.h(123): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/move.h(45): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/move.h(164): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/move.h(183): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(95): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(212): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(218): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(225): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(231): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(237): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(243): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_pair.h(284): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(116): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(162): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(174): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(185): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(199): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(208): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_types.h(216): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(71): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(88): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(112): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(121): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(132): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(148): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator_base_funcs.h(171): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator.h(95): error: this declaration may not have extern “C” linkage

/usr/include/c++/4.8/bits/stl_iterator.h(289): error: this declaration may not have extern “C” linkage

File Python.py:

mod = SourceModule(open(“kernels.cu”).read())
cublasGemm = mod.get_function(“gemm”)
cublasGemm(A, B, C[0], np.intc(m), np.intc(n), np.intc(k))

File kernals.cu:

global void cublas_gemm(float* A, float* B, float* C, int m, int n, int k){

            cublasHandle_t handle;
            //cublasCreate(&handle);

            cublasStatus_t status = cublasCreate(&handle);
                       
            float alpha = 1.0, beta = 0.0;
            // cublas use column-major
             cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, B, m, A, n, &beta, C, m);

}
#endif
//}

extern “C” void gemm(float* A, float* B, float* C, int m, int n, int k) {

cublas_gemm<<<1,1>>>(A,B,C, m, n, k);

}

You don’t need pycuda for that. You can use ctypes library/package in python to connect python to C-callable routines in a shared object (DLL). Here is a fully worked example:

$ cat t421.cu
#include <cublas_v2.h>

typedef float ftype;
extern "C"
void run(ftype *i1, ftype  *i2, ftype *o1, int d);

__global__ void p(ftype *i1, ftype *i2, ftype *o1, int d){

  cublasHandle_t h;
  cublasCreate(&h);
  ftype alpha = 1.0;
  ftype beta = 0.0;
  cublasSgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, d, d, d, &alpha, i1, d, i2, d, &beta, o1, d);
}

void run(ftype *i1, ftype  *i2, ftype *o1, int d){

  ftype *d_i1, *d_i2, *d_o1;
  int ds = d*d*sizeof(ftype);
  cudaMalloc(&d_i1, ds);
  cudaMalloc(&d_i2, ds);
  cudaMalloc(&d_o1, ds);
  cudaMemcpy(d_i1, i1, ds, cudaMemcpyHostToDevice);
  cudaMemcpy(d_i2, i2, ds, cudaMemcpyHostToDevice);
  p<<<1,1>>>(d_i1, d_i2, d_o1, d);
  cudaMemcpy(o1, d_o1, ds, cudaMemcpyDeviceToHost);
  cudaFree(d_i1);
  cudaFree(d_i2);
  cudaFree(d_o1);
}

$ nvcc t421.cu -arch=sm_61 -o t421.so -shared -Xcompiler -fPIC -rdc=true -lcublas -lcublas_device -lcudadevrt
@O@ptxas info    : 'device-function-maxrregcount' is a BETA feature
<...repeats...>
$ cat t421.py
import numpy
import ctypes

dim = 4

N = dim * dim

# initialize arrays
i1 = numpy.ones((N), dtype=numpy.float32)
i2 = numpy.ones((N), dtype=numpy.float32)
o1 = numpy.zeros((N), dtype=numpy.float32)

# import DLL
E = ctypes.cdll.LoadLibrary("./t421.so")

# run test
E.run( ctypes.c_void_p(i1.ctypes.data),
        ctypes.c_void_p(i2.ctypes.data),
        ctypes.c_void_p(o1.ctypes.data),
        ctypes.c_int(dim))

# print output array
for i in range(0, len(o1)):
    print o1[i]
$ cuda-memcheck python t421.py
========= CUDA-MEMCHECK
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
========= ERROR SUMMARY: 0 errors
$

Since we are multiplying two 4x4 matrices that have been initialized to all 1, we expect the output to be all 4, as indicated.

This particular example demonstrates CUDA dynamic parallelism in the kernel call, since we are calling a device CUBLAS routine from the kernel. However if you simply wanted to call the cublas routine from python, you would not need to use the CUDA kernel call. You could write a wrapper function similar to the “run” function which simply calls the cublas host library routines. The overall methodology could be similar.

I get a link error when I run this but I think it is because I am using CUDA 10 and I read that you cannot use cublas functions on device with CUDA 10.

Correct, the functionality described here (CUBLAS calls from device code) was deprecated prior to CUDA 10, and the functionality was deleted from CUDA 10.

[url]Release Notes :: CUDA Toolkit Documentation

Are there any alternative ways now(CUDA 10+) to call CUBLAS from device code?

not any provided by NVIDIA

you can construct some kinds of linear algebra in CUDA kernels on your own using CUTLASS

https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/