nvrtc execution problems

Hey folks,

I started to try to implement some cuda processing with the runtime compilation library for live compiling. The stuff I do is image processing with heavy parts of the code dependent on user input so I thought a short compilation before processing a load of images is the best way.

But now I have a problem executing the code as cuLaunchKernel results in a CUDA_ERROR_INVALID_VALUE.

I have packed my experiments into a sample single file demo code pasted to https://bpaste.net/show/a90b0436770d
It compiles on my Mac OS Yosemite 10.10.5 system, CUDA 7.5, clang/llvm version 7.0.2 with

clang++ nvrtc_test_single.cpp -o cudartctest-single -I $CUDA_PATH/include -L $CUDA_PATH/lib -lnvrtc -lcuda -lcudart -F/Library/Frameworks -framework CUDA -Wl,-rpath,$CUDA_PATH/lib

Result output is

Using CUDA device [0]: GeForce GTX 750
CUDA init - time: 87.081001 ms
Fileinfo:   Width=1920, Height=1080
CUDA
    CUDA - Memory-Prep - time: 9.111000 ms
    nvrtcProgramLog: 
    CUDA - Kernel RTC - time: 872.619019 ms
  Grid dimensions: 15 x 1080

error: cuLaunchKernel( kernel, CUDA_X_DIM, 1, 1, grid_dim_x, rgb->height, 1, 0, NULL, args, NULL) failed with error CUDA_ERROR_INVALID_VALUE

So I’d expect that the problem is either in line 192 or in 198 but I don’t really get what’s the problem as it should match the sample from http://docs.nvidia.com/cuda/nvrtc/index.html with only slight modifications.

(The saxpy sample from the nvrtc docu just works fine as is. Same compiler options.)

Does anyone have an idea what the problem is? Searched for over a day but can’t really find helpful information on how to debug this.

First, I would suggest that you
change this:

–gpu-architecture=compute_20

to this:

–gpu-architecture=compute_50

This is recommended since your GPU is a cc5.0 GPU (confirm cc5.0 with deviceQuery before making this change!), and useful if you are trying to take advantage of a grid X dimension which exceeds the cc2.0 limit of 65535 (although you’re not, in this case).

The issue is that your cuLaunchKernel arguments are not in the correct order:

This is what I see:

unsigned int grid_dim_x = (rgb->width+CUDA_X_DIM-1) / CUDA_X_DIM;
	printf("  Grid dimensions: %d x %d\n", grid_dim_x, rgb->height);
	CUDA_SAFE_CALL( cuLaunchKernel( kernel,
					CUDA_X_DIM, 1, 1, // grid dim
					grid_dim_x, rgb->height, 1, // block dim
					0, NULL, // shared mem and stream
					args, // arguments
					NULL) );

It seems fairly clear to me that you intend your CUDA_X_DIM to be the x block dimension, then the calclation for grid_dim_x would be sensible. However, in the launch kernel call, the grid dimensions come first (you even have a comment to this effect in your code, but seem to be ignoring it.) Since your code actually prints out grid dimensions as grid_dim_x, and rgb->height, I’ll assume that is your intent. In that case, your cuLaunchKernel call should be:

CUDA_SAFE_CALL( cuLaunchKernel( kernel,
					grid_dim_x, rgb->height, 1, // grid dim
					CUDA_X_DIM, 1, 1, // block dim
					0, NULL, // shared mem and stream
					args, // arguments
					NULL) );

Attempting to launch with block dimensiosn of 15x1080 violates 2 block dimension rules, which is why you were getting the invalid parameter error.

The documentation for cuLaunchKernel is here:

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15

With the above change, your code runs without any runtime errors for me.

Note that the gpu architecture compile change is not necessary for this fix. It’s just a suggestion.

I have to look at the guide what the differences are. cc5.0 would exclude many cases in the later use. My main machine for example has a GTX 780 that wouldn’t be cc5.0 compatible. But from the compute capability matrix I could at least go to 30 for my intended use. Or set it dynamically based on the current hardware may be the best bet.

Didn’t see the wood for the trees here. Thanks for your help. That was it. Had looked at the docu multiple times but as it is sometimes I missed the point.

Thank you very much. Helped a lot! As expected works with both compute architectures now.

I think that is a great idea.