__CUDA_ARCH__ undefined?!

What’s wrong here? I’m using this line to compile:

nvcc -gencode=arch=compute_20,code="sm_20,compute_20" test.cu -o test

and I get “CUDA_ARCH is undefined.”

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

__global__ void no_op()

{

    printf("%d\n", __CUDA_ARCH__);

}

int main()

{

    no_op<<<1, 1>>>();

    cudaThreadSynchronize();

return 0;

}

Very strange. I use CUDA_ARCH without any problems in a large project.
But your minimal example fails! (I tested it to make sure).

Gee, really?!! Maybe I should be updating my toolkit? I’m getting:

# nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2011 NVIDIA Corporation

Built on Sun_Mar_20_16:47:09_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221

EDIT: it’s just unreal, because everything else seems to work with this compiler! For example, inline assembly.

It is sometimes defined, but not in device code! If I add these three compiler directives, I get the warning message!

#if __CUDA_ARCH__ == 200

#warning "hello"

#endif

__global__ void no_op()

{

    printf("%d\n", __CUDA_ARCH__);

}

The message “CUDA_ARCH is undefined.” is emitted by [font=“Courier New”]cudafe++[/font] (the program that splits host and device code), not by the host compiler. In order to figure out where kernels and device functions end, it needs to completely parse the device routines even when it extracts the host code. And CUDA_ARCH is indeed undefined in host code.

The solution is simple, even though it may look silly first: Just protect your printf statement with a [font=“Courier New”]#ifdef CUDA_ARCH[/font]. It won’t change the generated code, but it allows parsing of the host code (where this statement will be dropped at a later stage later anyway) to proceed.

As far as I understand the compilation process, tera’s explanation is right on the money. As an addendum, one reason CUDA_ARCH is undefined in host code is because for fatbinary compilation targeting multiple device architectures, host code is only compiled once, so it can’t be associated with any particular CUDA architecture.

The recommended way to check for the CUDA architecture in device code is something like this:

#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 200)

In general CUDA architecture versions follow an onion-layer model, so the use of architectural features is usually best guarded by >= comparisons against CUDA_ARCH.

Yes, tera, njuffa, it works exactly as you say.

I also downloaded the latest version of the driver and toolkit, and it still works only that way. It makes me curious what DrAnderson42 meant by having “tested it,” but in any case, it works now.

Many thanks. :)

Thanks njuffa, RezaRob3. No Money involved though. External Image I wonder if Nvidia would offer me a job…

tera, I don’t know you at all personally, but have seen your name pop up here often. I learn from you even when you’re not responding to me directly.

Thank you.:)

EDIT: I hope you get your wish if you love Nvidia that much!

PS: My apology for being OFFTOPIC this once.

Equally off topic, to avoid misunderstandings: “right on the money” is an idiom meaning “exactly right”. I realize it might be best to avoid the use of potentially confusing idioms when writing in a forum with international audience.

Yes, well in my large code, all of the checks take the form of:

#if (__CUDA_ARCH__ >= 200)

And they are are all in global functions (which I thought was the only requirement). Which is why I was confused when running your test.