Sparse Matrix-Vector Multiplication not working when Matrix size 3 lacs + Algo give by Nathan Bell a

Hello All,

I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.

I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.

cuda tesla
I have used the CSR format for my matrix, also I have slightly modified the algorithm to handle complex numbers

struct complex {
double x;
double y;
};

// grid and thread size
int threadPerBlock = 512;
int numBlock=(VCTSIZE/threadPerBlock)+1; // VCTSIZE is the size of the vector to be multiplied = square matrix size

csr_spmv_kernel<<<numBlock, threadPerBlock>>>(row_size_ptr_kr, csrRowPtr, xInd, xVal, y, Finalans);

global void csr_spmv_kernel( const int *row_size_kr, // matrix row size
const int *csrRowPtr_kr, // row pointer
const int *xInd_kr, // Col index
const struct complex *xVal_kr, // values of the matrix
const struct complex *y_kr, // the vector
struct complex *Finalans_kr // final result
){

		int row, row_start, row_end, jj, row_size;
		double dot_real, dot_img;
		row_size = *row_size_kr;
		
                    row = blockDim.x * blockIdx.x + threadIdx.x; //blockDim.x*
			if(row < row_size){
				dot_real=0;
				dot_img=0; 
				row_start = csrRowPtr_kr[row];
				row_end = csrRowPtr_kr[row+1];
				for(jj = row_start; jj < row_end; jj++ ){
					  dot_real += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].x) - (xVal_kr[jj].y * y_kr[xInd_kr[jj]].y); 
					  dot_img  += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].y) + (xVal_kr[jj].y * y_kr[xInd_kr[jj]].x); 
					}
				Finalans_kr[row].x += dot_real;  
				Finalans_kr[row].y += dot_img;   
			}
	}

The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.

cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.

Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?

Can you pass the error value returned by errVal = cudaDeviceSynchronize() to cudaGetErrorString(errVal) to see what’s the error.

Thank a lot for your reply.

the error string returned by cudaGetErrorString() is “unspecified launch failure”, from what I read they say that this error is related to segmentation fault in the CUDA device.

How should I debug this error? I have been stuck with it for days.

Thanks.