Problem Lanuching Simple Kernel

steveStevens · August 14, 2015, 2:12am

I’m getting back into CUDA and I’ve made a simple testbed for a simple summing kernel (just sums two buffers into a third). The kernel seems to be retuning an “invalid argument” error, even though the arguments, upon visual inspection, seem to match up. From what I understand the grid and block sizes I’ve passed are valid. After 1-2 hours of troubleshooting I’ve decided to just ask for help as its probably something fundamental that I’m overlooking.

You can pull all my code from here:

compile with
$ make clean program

and launch with
$ ./program

You should see something like this:

Device Number: 0
Device name: GeForce GTX 780
Compute Capability: 3.5
asyncEngineCount: 1
Total Global Memory (kB): 6442254
Max Device Memory Pitch (kB): 2147483
Max Grid Size (2147483647, 65535, 65535)
Max Block Size (1024, 1024, 64)

Device Number: 1
Device name: GeForce GTX 780
Compute Capability: 3.5
asyncEngineCount: 1
Total Global Memory (kB): 6438977
Max Device Memory Pitch (kB): 2147483
Max Grid Size (2147483647, 65535, 65535)
Max Block Size (1024, 1024, 64)

Total Input Size: 20.000 (MB), GPU Size: 20.000 (MB),     Compute Chunk: 10.000 (MB), Total Array Size: 5000000, GPU Array Size: 5000000
Generating random number sets...

Number sets complete.

N Chunks: 2, Chunk Buffer Size: 10000000 (B)
Error (122): invalid argument
Error (122): invalid argument
100.00% complete
Testing 20 random entries for correctness...
Entry 1710655 -> 0.1365 + 0.8665 = 0.0000 ? 1.0030
Entry 2741888 -> 0.9434 + 0.0271 = 0.0000 ? 0.9705
Entry 4635666 -> 0.5155 + 0.5276 = 0.0000 ? 1.0431
Entry 1542590 -> 0.9620 + 0.3312 = 0.0000 ? 1.2932
Entry 1334456 -> 0.2502 + 0.6925 = 0.0000 ? 0.9428
Entry 2268829 -> 0.9961 + 0.7717 = 0.0000 ? 1.7678
Entry 3117315 -> 0.8368 + 0.9625 = 0.0000 ? 1.7992
Entry 2122969 -> 0.1726 + 0.2433 = 0.0000 ? 0.4159
Entry 4495006 -> 0.1719 + 0.6689 = 0.0000 ? 0.8408
Entry 201033 -> 0.5413 + 0.3923 = -202167757261742536184553579808544522240.0000 ? 0.9336
Entry 4862600 -> 0.8823 + 0.2407 = 0.0000 ? 1.1230
Entry 1214772 -> 0.7891 + 0.8005 = 0.0000 ? 1.5896
Entry 3072111 -> 0.5847 + 0.6357 = 0.0000 ? 1.2204
Entry 3261796 -> 0.6412 + 0.0355 = 0.0000 ? 0.6766
Entry 2621388 -> 0.8425 + 0.7116 = -nan ? 1.5541
Entry 1716700 -> 0.3450 + 0.3062 = 0.0000 ? 0.6513
Entry 4228020 -> 0.8023 + 0.3822 = 0.0000 ? 1.1845
Entry 3159745 -> 0.3746 + 0.1364 = 0.0000 ? 0.5110
Entry 4424746 -> 0.0623 + 0.6677 = 0.0000 ? 0.7299
Entry 714638 -> 0.4983 + 0.0809 = 0.0000 ? 0.5793

To my understanding, this should work no problem. My kernel is contained in a separate file, but I seem to have linked everything properly (c++ wrapper for the kernel, etc) and everything compiles without warning or error. Here is the compiler output you should see:

rm -f program lib/main.o lib/summer.o lib/program.so
rm -f -rf lib
g++ -Wall -ansi -pedantic -fPIC -std=c++11 -c main.cc -o lib/main.o -pthread -std=c++11
nvcc -c summer.cu -o lib/summer.o
g++ -Wall -ansi -pedantic -fPIC -std=c++11 -o program lib/main.o lib/summer.o -L /usr/local/cuda/lib64 -lcudart -pthread -std=c++11

I appreciate any help. Thanks.

njuffa · August 14, 2015, 2:28am

Assuming I am looking at the correct code, I see cudaMalloc() calls in the code whose return status is not being checked. So as a first step, I would suggest adding return status checks to all CUDA API calls and for all kernel launches . You can also try running your app under control of cuda-memcheck, recent versions of which include an API checker from what I gather,

allanmac · August 14, 2015, 2:54am

@steveStevens, blocks in your grid are just 1 thread:

main.cc:line 126 --> dim3 blocks(1);

The ‘nchunk’ and ‘nchunks’ variables are also suspicious.

steveStevens · August 14, 2015, 3:21am

I’m basically porting some fully functional opencl code.

I understand that 1 thread/block is a waste of space but I was under the impression that there was no minimum block size. Is that incorrect? If not, it shouldn’t be a problem.

Relating to OpenCL, it seems to me that the “grid” in CUDA is the “number of work groups” in OpenCL and that “block” is the “work group dimension / local size” in OpenCL.

I don’t see a problem w/ nchunk/nchunks. The “execution blocks” scheme used here works in an OpenCL implementation.

I definitely understand that this program is crude, not optimal, etc, that is actually the point, it’s going to be part of some tutorials where I slowly morph it into something “decent”.

steveStevens · August 14, 2015, 3:57am

Done. They execute without error, but it’s the same result (invalid args for any kernel calls).

njuffa · August 14, 2015, 4:17am

So the “invalid argument” errors are gone now? If they are still there, and they are not delayed reports from API calls prior to the kernel launch [since you now check every API call prior to the kernel launch], they must be from the kernel launch itself. Check for bad grid/block configurations, host pointers passed as device pointers or vice versa, out-of-resources conditions (e.g. registers, shared memory).

Are the kernel launch error checks performed in the correct order? You need to check for pre-launch and post-launch errors separately. E.g. by calling this macro after each kernel launch.

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

steveStevens · August 14, 2015, 4:35am

Sorry that was not clear, the invalid argument errors are still there. I’ll look at it with a fresh head tomorrow.

steveStevens · August 15, 2015, 8:19pm

I’ve added the macro to the C++ wrapper file that launches the kernel (see summer.cu on the github).
The output is…I guess the same thing but it just terminates now instead of continuing:

Device Number: 0
Device name: GeForce GTX 780
Compute Capability: 3.5
asyncEngineCount: 1
Total Global Memory (kB): 6442254
Max Device Memory Pitch (kB): 2147483
Max Grid Size (2147483647, 65535, 65535)
Max Block Size (1024, 1024, 64)

Device Number: 1
Device name: GeForce GTX 780
Compute Capability: 3.5
asyncEngineCount: 1
Total Global Memory (kB): 6438977
Max Device Memory Pitch (kB): 2147483
Max Grid Size (2147483647, 65535, 65535)
Max Block Size (1024, 1024, 64)

Total Input Size: 100.000 (MB), GPU Size: 100.000 (MB),     Compute Chunk: 10.000 (MB), Total Array Size: 25000000, GPU Array Size: 25000000
Generating random number sets...

Number sets complete.

N Chunks: 10, Chunk Buffer Size: 10000000 (B)
Cuda error in file 'summer.cu' in line 65 : invalid argument.

The line 65 of the code is the actual kernel launch:

Summer<<<grid, blocks>>>(input_one_d, intput_two_d, output_d);

which doesn’t make any sense as it matches the definition of the kernel (see right above it). Does this maybe have something to do with my grid/block size? here is their definition:

dim3 grid(n_chunk);
  dim3 blocks(1);

where n_chunk is something like 5 000 000 +/- 1 order of magnitude. As far as I understand the spec, this should be ok (if not wasteful re: the block size = 1).

njuffa · August 15, 2015, 8:38pm

Check whether the launch configuration is violating any of the maximum settings in force for your target architecture. These are documented in an appendix of the programming guide.

njuffa · August 15, 2015, 8:39pm

Check whether the launch configuration is violating any of the maximum settings in force for your target architecture. These are documented in an appendix of the programming guide.

Robert_Crovella · August 15, 2015, 8:49pm

n_chunk of 5,000,000 will only work for

dim3 grid(n_chunk)

if you have a cc3.0+ device and are compiling for a cc3.0+ architecture.

steveStevens · August 15, 2015, 8:56pm

Yep, pretty sure all of the above is satisfied, see the relevant system info from my program:

Compute Capability: 3.5

Max Grid Size (2147483647, 65535, 65535)

For sure I don’t have a grid size of 2 billion, I wouldn’t able to fit 2x3 = 6 billion floats (24 GB) on my DRAM, let alone a single video card.

EDIT: the “compiling for architecture” comment is interesting, sorry I glossed over it initially, how can I tell which I’m compiling for? Is that some sort of argument for nvcc?

njuffa · August 15, 2015, 9:05pm

Recent versions of nvcc default to building code for sm_20. Use the -arch flag to specify the architecture you are building for. In your case -arch=sm_30

That should fix your error. I pulled your program code, reduced the sizes to make it work on sm_21, added missing header files includes (,) and it seems to work fine:

Device Number: 0
Device name: Quadro 2000
Compute Capability: 2.1
asyncEngineCount: 1
Total Global Memory (kB): 1073741
Max Device Memory Pitch (kB): 2147483
Max Grid Size (65535, 65535, 65535)
Max Block Size (1024, 1024, 64)

Total Input Size: 1.000 (MB), GPU Size: 1.000 (MB)
Generating random number sets...

Number sets complete.

N Chunks: 4, Chunk Buffer Size: 250000 (B)
runSummer::grid=62500 x 1 x 1
runSummer::block=1 x 1 x 1
runSummer::grid=62500 x 1 x 1
runSummer::block=1 x 1 x 1
runSummer::grid=62500 x 1 x 1
runSummer::block=1 x 1 x 1
runSummer::grid=62500 x 1 x 1
runSummer::block=1 x 1 x 1
100.00% complete
Testing 20 random entries for correctness...
Entry 192054 -> 0.9278 + 0.2517 = 1.1795 ? 1.1795
Entry 119128 -> 0.3781 + 0.5501 = 0.9281 ? 0.9281
Entry 166073 -> 0.4864 + 0.4082 = 0.8946 ? 0.8946
Entry 175510 -> 0.7432 + 0.0036 = 0.7468 ? 0.7468
Entry 122007 -> 0.9454 + 0.3346 = 1.2799 ? 1.2799
Entry 203302 -> 0.5624 + 0.9980 = 1.5604 ? 1.5604
Entry 183613 -> 0.4170 + 0.7181 = 1.1351 ? 1.1351
Entry 121812 -> 0.0588 + 0.5763 = 0.6351 ? 0.6351
Entry 230593 -> 0.4287 + 0.2361 = 0.6648 ? 0.6648
Entry 26890 -> 0.0997 + 0.3194 = 0.4191 ? 0.4191
Entry 15176 -> 0.7963 + 0.6500 = 1.4463 ? 1.4463
Entry 140588 -> 0.1486 + 0.8387 = 0.9873 ? 0.9873
Entry 178940 -> 0.3894 + 0.9112 = 1.3006 ? 1.3006
Entry 227056 -> 0.1626 + 0.6090 = 0.7716 ? 0.7716
Entry 102187 -> 0.1270 + 0.7193 = 0.8463 ? 0.8463
Entry 214186 -> 0.0211 + 0.9344 = 0.9555 ? 0.9555
Entry 32708 -> 0.2255 + 0.3087 = 0.5343 ? 0.5343
Entry 180679 -> 0.4879 + 0.3965 = 0.8843 ? 0.8843
Entry 27957 -> 0.8918 + 0.1961 = 1.0879 ? 1.0879
Entry 93559 -> 0.4254 + 0.2029 = 0.6283 ? 0.6283

steveStevens · August 15, 2015, 9:37pm

Yes! Amazing, thank you. This is something I was completely unaware of…

To conclude: My grid size was unsupported by the compute architecture I was compiling for (2.0). I needed to specify compilation for the compute architecture of my card. Perhaps I will change the makefile to run some script first and figure out what the compute level of the target devices is, or just make a note for people to modify it accordingly for their own purpose.

njuffa · August 15, 2015, 10:15pm

I would suggest building a fat binary, which is standard practice for code that needs to run on GPUs of various different architectures. A fat binary embeds machine code (SASS) for any architecture you plan to support, along with PTX code for the latest shipping architecture, that can be JIT compiled on future GPU architectures. Look at the description of the -gencode switch of nvcc how to do that.

In general, when code targeted at older architectures is JIT-compiled on newer ones most restrictions of the original target architecture are still in force. In other words, the code behaves much as if it were running on a GPU of the originally specified target architecture. In this case the code was compiled for sm_20 (by default) and when running on sm_30 the sm_20 restrictions on grid size were in force.