__device__ function clarifications
Hi I have some questions regarding __device__ functions, mostly about the style of argument/result passing between the caller and callee.

0. Is it safe to pass a pointer of local variable as a function argument? Local variables in the kernel are mapped to registers, so the idea of taking a pointer
of register bothers me.

1. The __device__ functions are always inlined. So if I am trying to pass a struct as an input argument, I don't have to pass the pointer of it to reduce the overhead, right?

2. Likewise, due to inlining, Passing a result pointer to return a complex return data doesn't really improve anything, right? For example

[codebox]
struct bigstruct {
...
};

__device__ bigstruct foo1 () {
bigstruct ret;
... populate ret ...
return ret;
}

__device__ void foo2(bigstruct* ret_p) {
... populate *ret_p ...
}
[/codebox]

foo2 and foo1 are not really different in the overhead-wise, right?

3. Is reference(&) allowed for the kernel code? like this..

[codebox]
__device__ void foo3(bigstruct &ret) {
... populate ret ...
}
[/codebox]

It's confusing what is supported and what isn't supported in the device code. Kind of hard to draw a hard line between C and C++...

3. I consulted the "programming guide" and "reference manual". Are there any other documentation that I can read? Very little information is in the programming guide.
Hi I have some questions regarding __device__ functions, mostly about the style of argument/result passing between the caller and callee.



0. Is it safe to pass a pointer of local variable as a function argument? Local variables in the kernel are mapped to registers, so the idea of taking a pointer

of register bothers me.



1. The __device__ functions are always inlined. So if I am trying to pass a struct as an input argument, I don't have to pass the pointer of it to reduce the overhead, right?



2. Likewise, due to inlining, Passing a result pointer to return a complex return data doesn't really improve anything, right? For example



[codebox]

struct bigstruct {

...

};



__device__ bigstruct foo1 () {

bigstruct ret;

... populate ret ...

return ret;

}



__device__ void foo2(bigstruct* ret_p) {

... populate *ret_p ...

}

[/codebox]



foo2 and foo1 are not really different in the overhead-wise, right?



3. Is reference(&) allowed for the kernel code? like this..



[codebox]

__device__ void foo3(bigstruct &ret) {

... populate ret ...

}

[/codebox]



It's confusing what is supported and what isn't supported in the device code. Kind of hard to draw a hard line between C and C++...



3. I consulted the "programming guide" and "reference manual". Are there any other documentation that I can read? Very little information is in the programming guide.

#1
Posted 12/04/2008 12:49 AM   
[quote name='mjeong' post='472135' date='Dec 3 2008, 06:49 PM']0. Is it safe to pass a pointer of local variable as a function argument? Local variables in the kernel are mapped to registers, so the idea of taking a pointer
of register bothers me.[/quote]
Taking the address of a register will lead to erroneous behavior. At least the last forum post on the topic had a program that compiled fine but ran incorrectly, so presumably the compiler doesn't catch this problem..

[quote]1. The __device__ functions are always inlined. So if I am trying to pass a struct as an input argument, I don't have to pass the pointer of it to reduce the overhead, right?[/quote]
Right.

[quote]2. Likewise, due to inlining, Passing a result pointer to return a complex return data doesn't really improve anything, right? For example
foo2 and foo1 are not really different in the overhead-wise, right?[/quote]
Right.

[quote]3. Is reference(&) allowed for the kernel code? like this..[/quote]
Yes.

[quote]3. I consulted the "programming guide" and "reference manual". Are there any other documentation that I can read? Very little information is in the programming guide.[/quote]
Well, there is a wealth of information in the programming guide. 90% of all questions on this form could be solved simply if people read it before posting. Your question isn't one of those, though: for whatever reason, NVIDIA has chosen not to document what C/C++ features are allowed in kernels and what are not.

FYI, here are a few more undocumented features:
templated kernel code works very well, though it is technically unsupported as far as I know.

simple classes with __device__ member functions also work if you are very careful in how you write them (i.e. only simple data members, all members inlined, no requirement for dynamic memory, no polymorphism and a few other gotchas I can't think of at the moment)
[quote name='mjeong' post='472135' date='Dec 3 2008, 06:49 PM']0. Is it safe to pass a pointer of local variable as a function argument? Local variables in the kernel are mapped to registers, so the idea of taking a pointer

of register bothers me.

Taking the address of a register will lead to erroneous behavior. At least the last forum post on the topic had a program that compiled fine but ran incorrectly, so presumably the compiler doesn't catch this problem..



1. The __device__ functions are always inlined. So if I am trying to pass a struct as an input argument, I don't have to pass the pointer of it to reduce the overhead, right?


Right.



2. Likewise, due to inlining, Passing a result pointer to return a complex return data doesn't really improve anything, right? For example

foo2 and foo1 are not really different in the overhead-wise, right?


Right.



3. Is reference(&) allowed for the kernel code? like this..


Yes.



3. I consulted the "programming guide" and "reference manual". Are there any other documentation that I can read? Very little information is in the programming guide.


Well, there is a wealth of information in the programming guide. 90% of all questions on this form could be solved simply if people read it before posting. Your question isn't one of those, though: for whatever reason, NVIDIA has chosen not to document what C/C++ features are allowed in kernels and what are not.



FYI, here are a few more undocumented features:

templated kernel code works very well, though it is technically unsupported as far as I know.



simple classes with __device__ member functions also work if you are very careful in how you write them (i.e. only simple data members, all members inlined, no requirement for dynamic memory, no polymorphism and a few other gotchas I can't think of at the moment)

#2
Posted 12/04/2008 01:00 PM   
Thanks for the answer.

[quote name='MisterAnderson42' post='472344' date='Dec 4 2008, 07:00 AM']Taking the address of a register will lead to erroneous behavior. At least the last forum post on the topic had a program that compiled fine but ran incorrectly, so presumably the compiler doesn't catch this problem..[/quote]
Could you provide the link to the post? It's hard to find anything in this forum. Turn off that buggy flood-control NVIDIA!

Taking a pointer and dereferencing it is a fundamental C feature that I should not worry about... it's disappointing. what should I expect to work. :">

I don't know how a compiler inlines a function, but as I don't do any pointer arithmetic, it should be able to figure out and eliminates & and * s and substitute them with regular variable..

[quote]Right.


Right.


Yes.


Well, there is a wealth of information in the programming guide. 90% of all questions on this form could be solved simply if people read it before posting. Your question isn't one of those, though: for whatever reason, NVIDIA has chosen not to document what C/C++ features are allowed in kernels and what are not.[/quote]

Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.

[quote]FYI, here are a few more undocumented features:
templated kernel code works very well, though it is technically unsupported as far as I know.

simple classes with __device__ member functions also work if you are very careful in how you write them (i.e. only simple data members, all members inlined, no requirement for dynamic memory, no polymorphism and a few other gotchas I can't think of at the moment)[/quote]

That's interesting. I'd love to try them but kinda hesitating because it would be just one more suspect when things do not work as i want..
Thanks for the answer.



[quote name='MisterAnderson42' post='472344' date='Dec 4 2008, 07:00 AM']Taking the address of a register will lead to erroneous behavior. At least the last forum post on the topic had a program that compiled fine but ran incorrectly, so presumably the compiler doesn't catch this problem..

Could you provide the link to the post? It's hard to find anything in this forum. Turn off that buggy flood-control NVIDIA!



Taking a pointer and dereferencing it is a fundamental C feature that I should not worry about... it's disappointing. what should I expect to work. :">



I don't know how a compiler inlines a function, but as I don't do any pointer arithmetic, it should be able to figure out and eliminates & and * s and substitute them with regular variable..



Right.





Right.





Yes.





Well, there is a wealth of information in the programming guide. 90% of all questions on this form could be solved simply if people read it before posting. Your question isn't one of those, though: for whatever reason, NVIDIA has chosen not to document what C/C++ features are allowed in kernels and what are not.




Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.



FYI, here are a few more undocumented features:

templated kernel code works very well, though it is technically unsupported as far as I know.



simple classes with __device__ member functions also work if you are very careful in how you write them (i.e. only simple data members, all members inlined, no requirement for dynamic memory, no polymorphism and a few other gotchas I can't think of at the moment)




That's interesting. I'd love to try them but kinda hesitating because it would be just one more suspect when things do not work as i want..

#3
Posted 12/05/2008 07:43 PM   
At NVISION slides were shown that said NVIDIA is working to C++ and Fortran support fro CUDA. Given the fact that templates have been working for a long time (SDK examples use them), and the fact that more C++ features are apparently working in 2.1 beta, I think NVIDIA is taking the gradual approach to the C++ support. So I would expect that you will see more and more C++ functionality working in CUDA with each release.
At NVISION slides were shown that said NVIDIA is working to C++ and Fortran support fro CUDA. Given the fact that templates have been working for a long time (SDK examples use them), and the fact that more C++ features are apparently working in 2.1 beta, I think NVIDIA is taking the gradual approach to the C++ support. So I would expect that you will see more and more C++ functionality working in CUDA with each release.

greets,
Denis

#4
Posted 12/05/2008 08:37 PM   
[quote name='mjeong' post='472906' date='Dec 5 2008, 01:43 PM']Could you provide the link to the post? It's hard to find anything in this forum. Turn off that buggy flood-control NVIDIA![/quote]
I guess I was remembering incorrectly. This is the post I was thinking about, which results in a compiler crash:
[url="http://forums.nvidia.com/index.php?showtopic=78442&hl=address+of+register"]http://forums.nvidia.com/index.php?showtop...ess+of+register[/url]

[quote]Taking a pointer and dereferencing it is a fundamental C feature that I should not worry about... it's disappointing. what should I expect to work. :">[/quote]
The issue is that CUDA has many memory models (registers, global, local, shared) and if you get too creative with pointers the compiler cannot determine which one the pointer points to.

[quote]I don't know how a compiler inlines a function, but as I don't do any pointer arithmetic, it should be able to figure out and eliminates & and * s and substitute them with regular variable..[/quote]
I didn't think of that. It probably will work, as long as you don't do pointer arithmetic or array indexing on it. Why don't you try it out?

[quote]Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.[/quote]
Code I've debugged that ran in emulation and not device is that either:
1) Had out of bounds or uninitialized memory accesses (debugged with valgrind on emulation mode)
1a) Forgot to allocate dynamic shared memory in the kernel launch
or 2) Reading a host pointer on the device (only way to find these is to carefully comb through your code and use good naming to differentiate between host and device pointers).
[quote name='mjeong' post='472906' date='Dec 5 2008, 01:43 PM']Could you provide the link to the post? It's hard to find anything in this forum. Turn off that buggy flood-control NVIDIA!

I guess I was remembering incorrectly. This is the post I was thinking about, which results in a compiler crash:

http://forums.nvidia.com/index.php?showtop...ess+of+register



Taking a pointer and dereferencing it is a fundamental C feature that I should not worry about... it's disappointing. what should I expect to work. :">


The issue is that CUDA has many memory models (registers, global, local, shared) and if you get too creative with pointers the compiler cannot determine which one the pointer points to.



I don't know how a compiler inlines a function, but as I don't do any pointer arithmetic, it should be able to figure out and eliminates & and * s and substitute them with regular variable..


I didn't think of that. It probably will work, as long as you don't do pointer arithmetic or array indexing on it. Why don't you try it out?



Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.


Code I've debugged that ran in emulation and not device is that either:

1) Had out of bounds or uninitialized memory accesses (debugged with valgrind on emulation mode)

1a) Forgot to allocate dynamic shared memory in the kernel launch

or 2) Reading a host pointer on the device (only way to find these is to carefully comb through your code and use good naming to differentiate between host and device pointers).

#5
Posted 12/05/2008 09:19 PM   
[quote name='mjeong' post='472906' date='Dec 5 2008, 03:43 PM']Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.[/quote]
By commenting chunks out.
[quote name='mjeong' post='472906' date='Dec 5 2008, 03:43 PM']Again, I don't know how ppl debug a kernel when it runs in emulation mode and doesn't on the device. I cannot gdb (well there's gdb 2.1 now) or printf to see what's going on it, and ... i can use any feature with a peace of mind.

By commenting chunks out.

#6
Posted 12/06/2008 03:44 AM   
C++ support would really save some porting effort, but would they be able to run it fast with all the abstractions? /shifty.gif' class='bbc_emoticon' alt=':shifty:' />

The debugging device code part, well, i finally found out the problem. I posted it to the relavent thread. Array indexing in the kernel code just didn't work.

[url="http://forums.nvidia.com/index.php?showtopic=73806"]http://forums.nvidia.com/index.php?showtopic=73806[/url]

I would add one more debugging methodology...

explicitly cudamemcopy the value you're interested in to the host and examine it.

well.. i put this off till I exhausted all the other options because it's sooooo TEDIOUS... but it did catch my problem, i i guess it's worth trying
C++ support would really save some porting effort, but would they be able to run it fast with all the abstractions? /shifty.gif' class='bbc_emoticon' alt=':shifty:' />



The debugging device code part, well, i finally found out the problem. I posted it to the relavent thread. Array indexing in the kernel code just didn't work.



http://forums.nvidia.com/index.php?showtopic=73806



I would add one more debugging methodology...



explicitly cudamemcopy the value you're interested in to the host and examine it.



well.. i put this off till I exhausted all the other options because it's sooooo TEDIOUS... but it did catch my problem, i i guess it's worth trying

#7
Posted 12/10/2008 07:12 AM   
Scroll To Top