cutilSafeCall() Runtime API error: Invalid Argument

ceeely · September 13, 2016, 8:57am

I’ve been pulling my hair over the cause of this error in a gmres cuda coding:

float *reduce_d ,*reduce;
	cutilSafeCall( cudaHostAlloc( (void**)&reduce,  512 *sizeof(float) ,cudaHostAllocMapped) ) ;
	for( int i=0; i<512; ++i ) reduce[i] = 0.0 ;
	cutilSafeCall( cudaHostGetDevicePointer( (void**) &reduce_d, (void*)reduce, 0 )  ) ;

	float *r0_d ;// r0=b-Ax
	cutilSafeCall( cudaMalloc( (void**)&r0_d , vecSize * sizeof(float) ) ) ;
	cutilSafeCall( cudaMemcpyAsync( r0_d+vecSize-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));

	float *w_d ;
	cutilSafeCall( cudaMalloc( (void**)&w_d , vecSize * sizeof(float) ) ) ;
	cutilSafeCall( cudaMemcpyAsync( w_d+vecSize-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));
	
	float *v_d;
	cutilSafeCall( cudaMalloc( (void**)&v_d ,( m + 1 ) * vecSize * sizeof(float) ) ) ;
	for( int i = 0 ; i < m + 1; ++i ){
		cutilSafeCall( cudaMemcpyAsync( v_d+vecSize*i-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));
	}

The error arises from the for loop at the last line of the above code… Would anyone please tell me where have I went wrong?

cheinger · September 13, 2016, 1:59pm

It would error if:

v_d+vecSize*i-2048 < 0

Or:

val_d+nzSize-2048 < 0

Also maybe try to simplify the problem by removing the cudaMempyAsync and using a cudaMempy as you can guarantee it will be in sync with the host

Robert_Crovella · September 13, 2016, 2:45pm

Actually it will error if

vecSize*i-2048 < 0

This means for i=0, you will get an error.

If v_d is a properly allocated device pointer, v_d - 2048 almost certainly is not.

ceeely · September 14, 2016, 1:20am

Thanks, @cheinger , @txbob. Will try and update.

ceeely · September 16, 2016, 5:41am

Hi @cheinger and @txbob: Thank you for the tip, your suggestion solves the problem.

On a separate issue, is it normal is my speed up is only 3 times faster then using the CPU only?

Thanks!

LongY · September 16, 2016, 4:20pm

As long as your GPU code is correct meaning that your CPU and GPU code have the same results. It is also a sanity check strategy to make sure GPU code works well.

There are many factors determining the speedup between GPU and CPU. The speedup is defined as CPU_timing/GPU_timing. These factors include GPU and CPU models, the application you are working on (memory bound or compute bound), etc.

This thread has some discussion on the topic of CPU and GPU speed comparison.
[url]https://devtalk.nvidia.com/default/topic/953975/sequential-code-is-faster-than-parallel-how-is-it-possible-/[/url]