Passing variables by reference to __device__ function

Hi, I need help with passing arguments to device functions. Simply, what is giving me trouble is the following case:

I have 2 device functions:
device void foo1( float* a )
{

foo2( a ); //Notice I have not used the “&” this time as “a” is already a pointer.

}
device void foo2( float* b )
{

*b = *b + 2;

}

In the kernel I have the following:
global void kernel ( … )
{
float x;

foo1( &x);

}

Well, the code doesn’t work. While debugging I noticed that when calling foo1(), the parameter (address of x) is passed ok, but when foo1() calls foo2() there is a problem. What I see is that b is placed in a register. I can’t see its content.

Any idea? Is it recommended not to pass arguments by reference?

Can you post a complete example that reproduces the issue? Also, what version of the CUDA Toolkit are you using?

Thanks,

Cliff

Can you post a complete example that reproduces the issue? Also, what version of the CUDA Toolkit are you using?

Thanks,

Cliff

Thanks Cliff for your reply. The code is too long to put it here. I’m using CUDA Toolkit 3.1. I just want to know if passing arguments this way should work. When debugging, I see that some arguments (pointers actually) are in a register, and I can’t see their content, so I don’t know whether they are being passed ok, or not. I’ll try to put a portion of the code here.

Thanks Cliff for your reply. The code is too long to put it here. I’m using CUDA Toolkit 3.1. I just want to know if passing arguments this way should work. When debugging, I see that some arguments (pointers actually) are in a register, and I can’t see their content, so I don’t know whether they are being passed ok, or not. I’ll try to put a portion of the code here.

Hi,

The following code works afaik, so is there really a problem?

EDIT: replaced the pointers in the device routines with references, which is more easily readable,

since you can use the passed variables without derefencing.

This is not available for global functions, the compiler will issue an error.

[codebox]include <stdio.h>

include <cuda.h>

include “cutil_inline.h”

device void foo2( float &b )

{

//...

b = b + 2;

//...

}

device void foo1( float &a )

{

//.....

foo2( a ); //Notice you don't need an & as the compiler understands the reference.

//.....

}

global void kernel ( float *result, float kp )

{

//...

foo1( kp);

*result=kp;                      // result cannot be passed by reference (__global__ function)

//..

}

int main()

{

float a=3.f,*d_result,h_result;

cudaMalloc(&d_result,sizeof(*d_result));

kernel <<<1,1>>>(d_result,a);

cutilCheckMsg( "Kernel execution failed" );

cudaThreadSynchronize();

cudaMemcpy(&h_result,d_result,sizeof(h_result),cudaMemcpyDeviceToHos

t);

printf("3+2=%.0f\n",h_result);

return 0;

}

[/codebox]

Hi,

The following code works afaik, so is there really a problem?

EDIT: replaced the pointers in the device routines with references, which is more easily readable,

since you can use the passed variables without derefencing.

This is not available for global functions, the compiler will issue an error.

[codebox]include <stdio.h>

include <cuda.h>

include “cutil_inline.h”

device void foo2( float &b )

{

//...

b = b + 2;

//...

}

device void foo1( float &a )

{

//.....

foo2( a ); //Notice you don't need an & as the compiler understands the reference.

//.....

}

global void kernel ( float *result, float kp )

{

//...

foo1( kp);

*result=kp;                      // result cannot be passed by reference (__global__ function)

//..

}

int main()

{

float a=3.f,*d_result,h_result;

cudaMalloc(&d_result,sizeof(*d_result));

kernel <<<1,1>>>(d_result,a);

cutilCheckMsg( "Kernel execution failed" );

cudaThreadSynchronize();

cudaMemcpy(&h_result,d_result,sizeof(h_result),cudaMemcpyDeviceToHos

t);

printf("3+2=%.0f\n",h_result);

return 0;

}

[/codebox]

What happens if you move foo2 in front of foo1? Inlined functions are supposed to precede the calling functions, and all functions are inlined by default under CUDA.

Apart from the inlining issue, you might need a function prototype for foo2 to indicate it takes a float* as argument, not an int.

What happens if you move foo2 in front of foo1? Inlined functions are supposed to precede the calling functions, and all functions are inlined by default under CUDA.

Apart from the inlining issue, you might need a function prototype for foo2 to indicate it takes a float* as argument, not an int.

Ah, yes, I agree this is probably what was going on: foo2()'s prototype was missing.

Ah, yes, I agree this is probably what was going on: foo2()'s prototype was missing.

Sorry for not posting the complete code. Actually, the prototypes are defined before first function call for all the functions. I had once that problem and learnt the lesson.

I think I solved the problem. It was just some constant values that had wrong values.

And I say “I THINK”, because I have another problem. I cannot debug the hole code. If I compile the hole code with de -g and -G options, everything goes ok, but when I try to debug the code, it seams symbols are missing, because when breaking in the kernel, it says “Single stepping until exit from function Z3LORP3argPfS1_S1_S1”, and I cannot trace the code. So, what I have to do is remove parts of the code, calls to some “device functions”, and that way I can single step. The problem is that I need all the code to check that everything works ok.

Any idea why the compiler may not be including debug symbols?

Sorry for not posting the complete code. Actually, the prototypes are defined before first function call for all the functions. I had once that problem and learnt the lesson.

I think I solved the problem. It was just some constant values that had wrong values.

And I say “I THINK”, because I have another problem. I cannot debug the hole code. If I compile the hole code with de -g and -G options, everything goes ok, but when I try to debug the code, it seams symbols are missing, because when breaking in the kernel, it says “Single stepping until exit from function Z3LORP3argPfS1_S1_S1”, and I cannot trace the code. So, what I have to do is remove parts of the code, calls to some “device functions”, and that way I can single step. The problem is that I need all the code to check that everything works ok.

Any idea why the compiler may not be including debug symbols?