How can I use __device__ function pointer in CUDA ?

Dear all,

How can I use device function pointer in CUDA ?

I don’t want to use a batch of “switch - case” to handle the device function selection in my program since the number of func is large.

But it seems device function has no pointer in CUDA. Any feasible solution ?

Thanks in advance!

No Solution…

But you could hide the “swith-case” itself inside a “device” function and call it in appropriate place :-)

You can’t. At the moment, device functions are in-line expanded by the compiler. They don’t actually exist as code objects and have no call stack or physical address.

Thanks! You guys are correct about the inline…

It seems I have to use “switch -case” to deal with it… External Image

Hi

I have a quite similar problem, I would like to “link” a device function from the host code to the kernel, I guess it wont work. Any ideas for a workaround? What im trying to achieve is a c++ framework where you can easily parameter and run a complicated kernel (a parallel genetic algorithm acutally) but there is one funcion which is problem specific (the fitness funcition) and i would like to be able to link it from the same file, where I set the parameters. Multiple kernels wouldnt be nice…

Is there any possible solution for this?

Well to put it simply, i would like to acutally inject some code to the kernel via the host function, any possibilities for that?

If you know the arguments at compile time, you could use template instantiation with functors to define your kernel, parameterized by the template args.

I’ve done this many times to make multiple variants of kernels without having to cut and paste code together. Stuff like a base algorithm which takes an optional preprocess and postprocess functor. The postprocess may be a no-op, or maybe it does a test in the kernel so that only “high scoring” values are returned. The preprocess may be a no-op, or maybe it filters out possible candidates, etc.

It’s not as versatile as function pointers of course, but for a lot of hacks it’s really cool.

I’m pretty sure for my next project, I’ll be dynamically generating and compiling CUDA code on the fly to solve this problem. PyCUDA makes it very easy (and “developers” are the end-users for me, so nvcc is a prerequisite) to do things like this, and it seems to be the best approach in the absence of device-side linking. OpenCL definitely has an advantage here, as the compiler is built into the driver, so everyone already has it.

(In one of my previous projects, I used the template trick to generate all the possible combinations of kernel implementations I could possibly call at runtime. Was ugly but effective.)

[Clarified the wording of the initial paragraph in a later edit – nj]

Function pointers for device functions are supported in CUDA 3.2 on sm_2x platforms, based on the ABI that was introduced with CUDA 3.1. Here is a very simple app that shows that everything works exactly the same as function pointers in host code.

#include <stdio.h>

#include <stdlib.h>

#define N 5

__device__ float add_func (float x, float y)

{

    return x + y;

}

__device__ float mul_func (float x, float y)

{

    return x * y;

}

__device__ float div_func (float x, float y)

{ 

    return x / y;

}

typedef float (*op_func) (float, float);

__device__ op_func func[3] = { add_func, mul_func, div_func };

__device__ char* op_name[3] = { "add", "mul", "div" };

__device__ void op_array (const float *a, const float *b, float *res, int op, int n)

{

    for (int i = 0; i < N; i++) {

        res[i] = func[op](a[i], b[i]);

    }

}

__global__ void kernel (void)

{

    float x[N];

    float y[N];

    float res[N];

for (int i = 0; i < N; i++) {

        x[i] = (float)(10 + i);

    }

    for (int i = 0; i < N; i++) {

        y[i] = (float)(100 + i);

    }

for (int op = 0; op < 3; op++) {

        printf ("\nop=%s\n", op_name[op]);

        op_array (x, y, res, op, N);

        for (int i = 0; i < N; i++) {

            printf ("res = % 16.9e\n", res[i]);

        }

    }

}

int main (void) 

{

    kernel<<<1,1>>>();

    cudaThreadSynchronize();

    return EXIT_SUCCESS;

}

The build and run log (Linux64, C2050) looks as follows:

~ $ nvcc -arch=sm_20 -o funcptr funcptr.cu

~ $ funcptr

op=add

res =  1.100000000e+02

res =  1.120000000e+02

res =  1.140000000e+02

res =  1.160000000e+02

res =  1.180000000e+02

op=mul

res =  1.000000000e+03

res =  1.111000000e+03

res =  1.224000000e+03

res =  1.339000000e+03

res =  1.456000000e+03

op=div

res =  1.000000015e-01

res =  1.089108884e-01

res =  1.176470593e-01

res =  1.262135953e-01

res =  1.346153915e-01

It has been pointed out to me that device function pointers are discussed in section B.1.4.4 of the CUDA C Programming Guide, and that there is an example application FunctionPointers in the SDK.

Well templating is not an option for me (at least based on the little what i know about templates, but these preprocess postprocess functors done with templates sound very interesting, can you show a quick example of that? thanks), because the functions not have to be alike. What I try to create is a system that closed and easy to use, you just give it some arguments, but one of these “arguments” is actually has to be a complete function, that tells the algorithm what to do with the data provided, so it can calculate how good is the given individual. And this function is totally problem specific, it could be a simple sqrt() to some insane differential equation. So i Have to link it somehow and prefereably link it through C++ host code, not just within the kernel, i have to pass the pointer to the kernel from the host function. I’ve seen the example of device funcion pointer here, and i have to admit Im pretty new to device pointers and dont know if that example applies to my case, and for testing I have to get a brand new VGA…

So I would be really grateful if someone can help me clarify this :)

Thanks in advance!

If your function set is unknown at compile time (for example - if function is defined by the user input in the edit box), then the only option I can see for you is to generate the kernel source and dynamically compile your kernel. The 3.2 api allows you to do that with ModuleLoadDataEx.

Is it possible to set up function pointers table dynamically from the host?

For example, if I later want to implement other op_func functions, how would you register

those functions dynamically from the host:

__device__ op_func func[MAX_NUM_OPS];

How should add_func, mul_func, and div_func be properly registered

from the host?

I don’t think so (or at least I have not been able to get it to work). Using the driver API, it doesn’t seem possible to call cuModuleGetFunction with a device function name and retrieve a function handle. Looking inside a cubin file, only global functions seem to get written into the ELF index, even though device functions are visible in the symbol table. I am guessing that unless and object has an index entry in the cubin, the driver API does not expose any way of getting a handle to it.

Maybe not automatically, but can’t you manually keep a pointer to device functions in a device variable and then manipulate it from the host as desired?

Occam’s Razor in action. That does work just fine. Thanks for the suggestion.

Could you, please, provide an example? I believe I tried this to no avail.

I’m thinking something like this (although I haven’t actually tested it):

__device__ int func1(int par1)

{

    ...

}

__device__ int func2(int par1)

{

    ...

}

__device__ int func3(int par1)

{

    ...

}

__constant__ int (*func_table_d(int))[3] = {&func1, &func2, &func3);

int (*func_table(int))[3];

__global__ testkernel((*func)(int))

{

   ...

}

int main(void)

{

    int n=1;

    ...

    cudaMemcpyFromSymbol(func_table, "func_table_d", sizeof(func_table), 0, cudaMemcpyDeviceToHost);

testkernel<<<12, 34>>>(func_table[n]);

    ...

}

Caveat: Completely untested!

I was able to prove it to myself using this hacked version of Norbert’s example:

#include <stdio.h>

#include <stdlib.h>

#define N 5

__device__ float add_func (float x, float y)

{

    return x + y;

}

__device__ float mul_func (float x, float y)

{

    return x * y;

}

__device__ float div_func (float x, float y)

{ 

    return x / y;

}

typedef float (*op_func) (float, float);

__device__ op_func action;

__device__ op_func funcs[3] = { add_func, mul_func, div_func };

__device__ void op_array (const float *a, const float *b, float *res, op_func f, int n)

{

    for (int i = 0; i < N; i++) {

        res[i] = f(a[i], b[i]);

    }

}

__global__ void kernel (void)

{

    float x[N];

    float y[N];

    float res[N];

for (int i = 0; i < N; i++) {

        x[i] = (float)(10 + i);

    }

    for (int i = 0; i < N; i++) {

        y[i] = (float)(100 + i);

    }

op_array (x, y, res, action, N);

    for (int i = 0; i < N; i++) {

        printf ("res = % 16.9e\n", res[i]);

    }

}
#include <stdio.h>

#include <cuda.h>

int main(void)

{

	CUmodule   mhandle;

	CUcontext  chandle;

	CUfunction khandle;

	CUdeviceptr funcshandle, actionhandle;

	size_t op_funcsz;

	cuInit(0);

	cuCtxCreate( &chandle, 0, 0 );

	cuModuleLoad( &mhandle, "funcpointer.cubin");

	cuModuleGetFunction( &khandle, mhandle, "_Z6kernelv");

	cuModuleGetGlobal( &funcshandle, NULL, mhandle, "funcs");

	cuModuleGetGlobal( &actionhandle, &op_funcsz, mhandle, "action");

	for(int i=0; i<3; i++) {

		size_t offset = size_t(i) * op_funcsz;

		cuMemcpyDtoD(actionhandle, funcshandle+offset, op_funcsz);

		cuFuncSetBlockShape( khandle, 1, 1, 1 );

		cuLaunchGrid( khandle, 1, 1);

		cuCtxSynchronize();

	}

	cuCtxDestroy(chandle);

	return 0;

}
avidday@cuda:~$ nvcc -cubin -arch=sm_20 -Xptxas="-v" funcpointer.cu 

ptxas info    : Compiling entry function '_Z6kernelv' for 'sm_20'

ptxas info    : Used 22 registers, 32 bytes cmem[0], 28 bytes cmem[14]

avidday@cuda:~$ g++ funcpointmain.c -I $CUDA_INSTALL_PATH/include -lcuda -o funcpointer.exe

avidday@cuda:~$ ./funcpointer.exe 

res =  1.100000000e+02

res =  1.120000000e+02

res =  1.140000000e+02

res =  1.160000000e+02

res =  1.180000000e+02

res =  1.000000000e+03

res =  1.111000000e+03

res =  1.224000000e+03

res =  1.339000000e+03

res =  1.456000000e+03

res =  1.000000015e-01

res =  1.089108884e-01

res =  1.176470593e-01

res =  1.262135953e-01

res =  1.346153915e-01

This works, yes.

But it’s a slightly different than the problem I had in mind.

Some pseudocode is below.

On device:

typedef float (*op_func) (float, float);

struct foo

{

   float a, b;

   op_func op;

};

__device__ foo* foo_array; 

__global__ void kernel( ... )

{

	float val = ( *foo_array[i].op )( .... );

}

On host:

// Allocate memory for foo_array

foo* foo_array_h;

cudaMalloc( foo_array_h, sizeof( foo )*N );

for( i = 0; i < N; i++ ) {

     foo_array_h[i].a = ..

     foo_array_h[i].b = ..

     foo_array_h[i].op = ??????

}

// Copy foo_array_h to device to

// be used in kernel

Since you cannot take an address of a device function

on host, I can’t set the op function pointer. It seems that this

cannot be done at all given current CUDA restrictions.

One way around it is to create a table of function pointers on

a device and and index to this table instead of the pointer

itself:

struct foo

{

   float a, b;

   int op;

};

__device__ foo* foo_array;

__device__ op_func op_array[] = { add_op, mul_op, ... }

__global__ void kernel( ... )

{

	float val = ( *op_array[foo_array[i].op] )( .... );

}

This works, but it’s not very flexible and requires maintaining

a table of all ops which becomes problematic if other users

(that don’t necessarily have access to the op table) are

allowed to write their own op_func and extend the table.

Sugestions?

But you can, and the code I posted does exactly that:

cuModuleGetGlobal( &funcshandle, NULL, mhandle, "funcs");

This is getting a pointer to the array of function pointers on the device side. You should be able to do the assignment you want using the same idea: read the function pointer value from a device symbol, then assign it to a value in host memory and copy it back into device memory. My code uses a device to device copy to do the same thing, but the idea is still the same.