cudaGetDevice does not work on device

I am trying to get cudaGetDevice to work on the device. Here is my program:


#include <stdio.h>

global void hello (void)
{
int adevice;
cudaError_t code = cudaGetDevice (&adevice);
if (code == cudaSuccess)
printf (“device: cudaGetDevice succeeded: code %08x\n”, code);
else
printf (“device: cudaGetDevice failed: code %08x\n”, code);

printf (“Hello from device %d\n”, adevice);

// const char *astring = cudaGetErrorString (code);
// printf (“device: cudaGetDevice returned with code "%s"\n”, astring);
}

int main (void)
{
hello<<<1,1>>> ();
cudaError_t code = cudaPeekAtLastError ();

printf (“host: kernel returned with code "%s"\n”,
cudaGetErrorString (code));

code = cudaDeviceSynchronize();
printf (“host: cudaDeviceSynchronize returned with code "%s"\n”,
cudaGetErrorString (code));

printf (“Hello from host\n”);
return 0;
}


Here is the output:

host: kernel returned with code “no error”
device: cudaGetDevice failed: code 0000001e
Hello from device 0
host: cudaDeviceSynchronize returned with code “no error”
Hello from host


Note that cudaGetDevice did not return cudaSuccess (00000000)
but rather returned a code of 1e, which is not a valid cudaError_t.
(according to the cuda.h file). Despite this, the kernel returns
properly, and the hello routine thinks that it is on device number 0
(irrespective of whether this is the case of not).

If I uncomment the last two lines of void hello (void), the output is


host: kernel returned with code “unknown error”
host: cudaDeviceSynchronize returned with code “unknown error”
Hello from host


In other words, cudaGetErrorString (code) is failing, and terminating
the kernel, because code has an invalid value (i.e. 0000001e ).
Why is cudeGetDevice failing and returning an invalid error code?

To give a reasonably confident answer here, it would be important to know your compile command line as well as what GPU you are running it on.

Here is the compile sequence


echo “CUDADIR = $CUDADIR”
alias nvcc=“$CUDADIR/bin/nvcc”
nvcc what_gives2.cu -arch=sm_61 --device-c -o what_gives2.o
nvcc -o what_gives2 -std=c++11 what_gives2.o -L/usr/local/cuda-8.0/lib64 -lcudart


Here is a print out of some of the fields of cudaDeviceProp
Number of devices found: 2


   Device 0

name = “GeForce GTX 1080 Ti”
major = 6
minor = 1


   Device 1

name = “GeForce GTX 1080 Ti”
major = 6
minor = 1

Here is the output of the compile:

CUDADIR = /usr/local/cuda-8.0
nvcc warning : The ‘compute_20’, ‘sm_20’, and ‘sm_21’ architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvlink warning : SM Arch (‘sm_20’) not found in ‘what_gives2.o’


I should also say that the device call to cudaGetDevice supplies a value of
device=0, even when I precede the kernel call by a call in the host:

cudaSetDevice(1);
cudaGetDevice (&device);

which gives a value of 1 for device (as expected). In other words, the host says
device is 1, and the device says device=0 (but the call to cudaSetDevice is returning
the false error message, as discussed.

Thanks for your attention.

so first of all, using the device runtime (cuda runtime API used in device code) requires appropriate compile and link steps.

In particular, use of relocatable device code as well as linking against the device runtime.

-rdc=true -lcudadevrt

any of the cuda sample codes that use the device runtime demonstrate the proper steps in their project settings. It’s generally good advice to always build projects targetting the device you intend to use. using a sm_20 target (default target for CUDA 8) on a sm_61 device isn’t really a great idea.

In fact, if you compile more than a single cuda device runtime call in your kernel code, I think you’ll get an error.

You might want to try those changes and see if you get different results.

I see I was getting the warnings about sm_20 because I missed the -arch-sm_61 argument on my link
step. However, it makes no difference. I include the flags that you suggest and for simplicity do
compile and link in one step with the command:

nvcc what_gives2.cu -arch=sm_61 -rdc=true -o what_gives2 [-lcudadevrt ]

(I use or omit the flag -lcudadevrt – it makes no difference. I presume it is ignored.)
In any case, in either case it compiles and links without warning or error. However,
the results are unchanged from before – cudaGetDevice does not work on the device.

Since this is about the simplest example one could think of of calling cudaGetDevice, I
am coming to the conclusion that it is buggy, and simply does not work on the device, despite
documentation, at least using cuda-8.0. Does anyone have an example where it actually works?

Thanks txbob.

I have worked out it must be some sort of curious bug, but have found a fix.

Here is some code:


#include <stdio.h>

global void hello (void)
{
cudaError_t code;

// Should be unnecessary, but without this, the following
// call to cudaGetDevice fails with an unknown error.
int devcount;
code = cudaGetDeviceCount (&devcount);

// Now, get the device
int adevice;
code = cudaGetDevice (&adevice);
if (code == cudaSuccess)
printf (“device: cudaGetDevice succeeded: code %08x, adevice = %d\n”, code, adevice);
else
printf (“device: cudaGetDevice failed: code %08x\n”, code);

// Identify the device
printf (“Hello from device %d\n”, adevice);
}

int main (void)
{
cudaSetDevice (0);
hello<<<1,1>>> ();

cudaError_t code;
code = cudaPeekAtLastError ();
printf (“host: kernel returned with code "%s"\n”,
cudaGetErrorString (code));

cudaSetDevice (1);
hello<<<1,1>>> ();

code = cudaPeekAtLastError ();
printf (“host: kernel returned with code "%s"\n”,
cudaGetErrorString (code));

code = cudaDeviceSynchronize();
printf (“Hello from host\n”);
return 0;
}


Here is the compile command:

which nvcc
nvcc what_gives3.cu -arch=sm_61 -rdc=true -o what_gives3


Here is the output of the compile command

/usr/local/cuda-8.0/bin/nvcc


Here is the output from running the command

host: kernel returned with code “no error”
device: cudaGetDevice succeeded: code 00000000, adevice = 0
Hello from device 0
host: kernel returned with code “no error”
device: cudaGetDevice succeeded: code 00000000, adevice = 1
Hello from device 1
Hello from host

This is what is expected.


If I comment out the lines:

int devcount;
code = cudaGetDeviceCount (&devcount);

here is what the output is:

host: kernel returned with code “no error”
device: cudaGetDevice failed: code 0000001e
Hello from device 0
host: kernel returned with code “no error”
device: cudaGetDevice failed: code 0000001e
Hello from device 0
Hello from host

Note that cudaGetDevice returns with an impossible error
(0000001e is not a known error).

Other device commands, such as cudaDeviceGetAttriute also
seem to work instead of cudaGetDeviceCount in this context
to prevent cudaGetDevice from failing. The other device
command that I have tried seem to work – just not cudaGetDevice.

Yes, there may be some strange bug if you use a single CUDA device runtime API call in your kernel code. If you use two or more calls, and proper device compilation settings, according to my testing things seem to work correctly. I have already some time ago filed a bug internally regarding the issue if you only use a single device runtime API call in your kernel code, you’re welcome to file your own bug at http://developer.nvidia.com if you wish.