questions about CUDA 3.1
I heard that CUDA supports C++ and recursion now. I am wondering if I can use STL::list or STL::vector in the __global__ and __device__ functions.
I heard that CUDA supports C++ and recursion now. I am wondering if I can use STL::list or STL::vector in the __global__ and __device__ functions.

#1
Posted 07/13/2010 10:17 PM   
[quote name='laoen' post='1087569' date='Jul 13 2010, 06:17 PM']I heard that CUDA supports C++ and recursion now. I am wondering if I can use STL::list or STL::vector in the __global__ and __device__ functions.[/quote]

1) Recursion works in 3.1 + Fermi card:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>

__device__ int fact(int f)
{
if (f == 0)
return 1;
else
return f * fact(f - 1);
}

__global__ void gfact(int * result, int f)
{
*result = fact(f);
}

int main()
{
dim3 Dg(1, 1);
dim3 Db(1);
size_t Ns = 0;

int * result;
int errMalloc = cudaHostAlloc((void**)&result, sizeof(int), cudaHostAllocMapped);
int * dresult;
int err = cudaHostGetDevicePointer((void**)&dresult, (void*)result, 0);
gfact<<<Dg, Db, Ns>>>(dresult, 5);
cudaThreadSynchronize();
cudaError_t ek = cudaGetLastError();
if (ek)
{
std::cout << "Error in kernel call " << ek << std::endl;
return 1;
}
std::cout << "Value = " << *result << std::endl;
cudaFreeHost(result);
return 0;
}

However, I have to compile with sm_20, and run it on a Fermi card. The kernel call fails with error code 8 on a GeForce 8900.
It won't compile under sm_11:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.1\bin\nvcc.exe" ...
produces: "Error: Recursive function call is not supported yet: fact(int)"


2) Ever since I've used Cuda (2.x), it's supported C++, features like classes, templates, etc. You can use classes in your kernel code, but depending on what you want, it can be cumbersome. In classes, you have to have __device__ (and sometimes __global__, too) qualifiers on your member functions, but you cannot on the constructors or destructors. You can pass a class object between host and device, either a copy of it (e.g., class foo {...}; __global__ bar(foo v) {...} int main() { foo xxx; ...; bar<<<...>>>(xxx); ...}) or by native pointer (__global__ bar(foo * v)). But reference parameters "& var" are not possible.

However, the bigger problem is representing complex data structures. For example, consider a graph with nodes and edges. You could represent a node as a class containing a list of native pointers to all those edge objects, and edges with two native pointers for the start and end of the edge, but you'd have to copy those objects to device space and the native host pointers to the device memory addresses. Yuk. Instead, people represent pointers usually as integer offsets into a big block, and allocate the objects out of that big block. You could hide a lot by overloading the ->, *, and [] operators, and you would have to write your own memory space malloc and new operator. I've been hacking at a dlmalloc for this purpose, so I can maybe allocate objects in device code, but mostly just so I can have common code between the host and device to access the complex data structure.

Lot's of other issues of course. Just play around and you'll see.

I can't get std::vector declarations and calls to compile, but I didn't think it would anyways. Functions called from the device have to have the __device__ qualifier and you still cannot call a __global__ function from a __device__ function. Calls to printf are supposed to work in 3.1 + Fermi card, but I can't seem to get it to work. Kind of disappointing after it was promised, and with Nsight still in beta, unless I'm doing something wrong.
[quote name='laoen' post='1087569' date='Jul 13 2010, 06:17 PM']I heard that CUDA supports C++ and recursion now. I am wondering if I can use STL::list or STL::vector in the __global__ and __device__ functions.



1) Recursion works in 3.1 + Fermi card:



#include <stdio.h>

#include <stdlib.h>

#include <iostream>

#include <cuda.h>



__device__ int fact(int f)

{

if (f == 0)

return 1;

else

return f * fact(f - 1);

}



__global__ void gfact(int * result, int f)

{

*result = fact(f);

}



int main()

{

dim3 Dg(1, 1);

dim3 Db(1);

size_t Ns = 0;



int * result;

int errMalloc = cudaHostAlloc((void**)&result, sizeof(int), cudaHostAllocMapped);

int * dresult;

int err = cudaHostGetDevicePointer((void**)&dresult, (void*)result, 0);

gfact<<<Dg, Db, Ns>>>(dresult, 5);

cudaThreadSynchronize();

cudaError_t ek = cudaGetLastError();

if (ek)

{

std::cout << "Error in kernel call " << ek << std::endl;

return 1;

}

std::cout << "Value = " << *result << std::endl;

cudaFreeHost(result);

return 0;

}



However, I have to compile with sm_20, and run it on a Fermi card. The kernel call fails with error code 8 on a GeForce 8900.

It won't compile under sm_11:



"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.1\bin\nvcc.exe" ...

produces: "Error: Recursive function call is not supported yet: fact(int)"





2) Ever since I've used Cuda (2.x), it's supported C++, features like classes, templates, etc. You can use classes in your kernel code, but depending on what you want, it can be cumbersome. In classes, you have to have __device__ (and sometimes __global__, too) qualifiers on your member functions, but you cannot on the constructors or destructors. You can pass a class object between host and device, either a copy of it (e.g., class foo {...}; __global__ bar(foo v) {...} int main() { foo xxx; ...; bar<<<...>>>(xxx); ...}) or by native pointer (__global__ bar(foo * v)). But reference parameters "& var" are not possible.



However, the bigger problem is representing complex data structures. For example, consider a graph with nodes and edges. You could represent a node as a class containing a list of native pointers to all those edge objects, and edges with two native pointers for the start and end of the edge, but you'd have to copy those objects to device space and the native host pointers to the device memory addresses. Yuk. Instead, people represent pointers usually as integer offsets into a big block, and allocate the objects out of that big block. You could hide a lot by overloading the ->, *, and [] operators, and you would have to write your own memory space malloc and new operator. I've been hacking at a dlmalloc for this purpose, so I can maybe allocate objects in device code, but mostly just so I can have common code between the host and device to access the complex data structure.



Lot's of other issues of course. Just play around and you'll see.



I can't get std::vector declarations and calls to compile, but I didn't think it would anyways. Functions called from the device have to have the __device__ qualifier and you still cannot call a __global__ function from a __device__ function. Calls to printf are supposed to work in 3.1 + Fermi card, but I can't seem to get it to work. Kind of disappointing after it was promised, and with Nsight still in beta, unless I'm doing something wrong.

#2
Posted 07/14/2010 02:11 AM   
If you need STL::vector, you should look into the [url="http://thrust.googlecode.com/"]thrust library[/url].
If you need STL::vector, you should look into the thrust library.

#3
Posted 07/14/2010 02:09 PM   
Scroll To Top