Max Used Register compile setting affecting kernel launch?

So this is a weird problem;

A kernel in an application launches many 1024 thread blocks and each thread block uses 32,032 bytes of shared memory.
Usually I do not launch kernels with that many threads per block or with that much shared memory, but this time it seemed necessary.

When I set the Max Used Register value to any number other than zero that kernel seems to not launch at all, but all the other kernels in the application launch successfully.
No error messages or warnings appear at all( I check every device operation for errors in the typical manner), and the timer I set to time that kernel says the elapsed time is 0.0

When I set the Max Used Register value to 0 and compile again that kernel does launch and returns what seems to be a reasonable answer. In that case it take about 2.8 seconds to finish (it is a very large memory bound kernel).

Not sure what this is all about, and I probably will re-design the kernel, but curious about what may be going on.

Could this be a resource issue? The compilation output (with max register set to 0) for that specific kernel looks like this:

1>      176 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 62 registers, 32032 bytes smem, 372 bytes cmem[0]

GTX Titan X, Win7 x64, Visual Studio 2012 compiler, CUDA 6.5

I don’t believe it. Either the kernel launch is not failing, or you are not properly checking for kernel launch errors. While it’s remotely possible that you have stumbled on a bug in CUDA that violates this principle, I personally doubt it, and would not bet on it.

My guess is you don’t check for kernel launch errors properly.

Have you run your code with cuda-memcheck ? (in the failing case) By cuda-memcheck I am referring to the standalone utility, not the “cuda memory checker” built into nsight VSE.

This is the call, and every CUDA operation checks for errors.

extern "C" void accumulate_to_detector_large_number_wrap(
			const float *mua_list,
			const float *ppath_data,
			const int2 *photon_detID_and_start_medium,
			float *det_readings,
			const int num_det,
			const int num_to_do,
			const int num_photons,
			const int num_media,
			const int num_mua){

	const int mua_div_8=num_mua/8;
	dim3 grid( ( (num_photons/4)+1024-1)/1024 ,mua_div_8,1);
	accumulate_to_detector_large_number<<<grid,1024>>>(mua_list,ppath_data,photon_detID_and_start_medium,det_readings,num_det,num_to_do,num_photons,num_media,num_mua);
	cudaError_t err= cudaThreadSynchronize();
	if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
}

This is a MATLAB DLL so slightly more complicated to profile than a common application, but I will run it through cuda memcheck.

Still, usually when there is such a memory problem it lets you know, and also that compilation flag change solved the problem without any other modifications.

Your CUDA error checking after the kernel call is incorrect.

google “proper cuda error checking” and take the first hit. Then modify your code to match proper cuda error checking after a kernel call, and see what is reported.

ok I changed all to your exact reference error check implementation for every cuda malloc, memcpy, memset, kernel invocation, cudaFree…, recompiled, and no errors reported.

I was not only checking after kernel calls, but also repeatedly looking at the cudaError_t value after every operation already.

extern "C" void accumulate_to_detector_large_number_wrap(
			const float *mua_list,
			const float *ppath_data,
			const int2 *photon_detID_and_start_medium,
			float *det_readings,
			const int num_det,
			const int num_to_do,
			const int num_photons,
			const int num_media,
			const int num_mua){

	const int mua_div_8=num_mua/8;
	dim3 grid( ( (num_photons/4)+1024-1)/1024 ,mua_div_8,1);
	accumulate_to_detector_large_number<<<grid,1024>>>(mua_list,ppath_data,photon_detID_and_start_medium,det_readings,num_det,num_to_do,num_photons,num_media,num_mua);
	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk( cudaDeviceSynchronize() );
}

Thanks for the reference, and I am sure I made a mistake somewhere, but that was not it.

The problem was not me, rather what I suspected which was found via CUDA-MEMCHECK;

========= Program hit cudaErrorLaunchOutOfResources (error 7) due to "too many resources requested for launch" on CUDA API call to cudaLaunch.

And when I set Max Used Register to 0, this is the result of CUDA-MEMCHECK

========= CUDA-MEMCHECK
Capable!

Using single GPU GeForce GTX TITAN X

========= ERROR SUMMARY: 0 errors

Man those CUDA-MEMCHECK take some time…

The first run was compiled with Max Used Register=64, which caused the launch to be aborted.

The seconds run was the exact same code compiled with Max Used Register=0, and there were no problems.

When you set the register limit to zero, you’re getting 62 registers used. This fits within the aggregate register limit on your device (62 * threads per block). When you set the register limit to 64, in all probability the compiler is using 64 registers per thread, and this does not fit. As a test, set it to something like 20, 30, or 40, and it will probably work.

The fact that cuda-memcheck finds the error and your error checking code does not means that your error checking code is still defective. This type of launch error (“too many resources requested for launch”) will only be caught by this type of error checking immediately after the kernel:

gpuErrchk( cudaPeekAtLastError() );

which you didn’t have in your original report. Somewhere you have a kernel without that, or else your error checking code is otherwise defective somehow.

Note that the -maxrregcount switch affects all kernels compiled with that switch.

I appreciate your help.

There was not a out of bounds memory error on my part, and my suspicion that there was a resource problem was correct.

As far as error checking goes, have changed all to that approach but it does not catch that error.

proper error checking will catch the resource problem too (even if you run without cuda-memcheck). Something is deficient in your code in this respect.