Sparse Matrix-Vector Multiplication not working when Matrix size 3 lacs + Algo give by Nathan Bell a

Hello All,

I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.

I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.

http://www.nvidia.com/object/nvidia_research_pub_001.html
cuda tesla
I have used the CSR format for my matrix, also I have slightly modified the algorithm to handle complex numbers

struct complex {
double x;
double y;
};

// grid and thread size
int threadPerBlock = 512;
int numBlock=(VCTSIZE/threadPerBlock)+1; // VCTSIZE is the size of the vector to be multiplied = square matrix size

csr_spmv_kernel<<<numBlock, threadPerBlock>>>(row_size_ptr_kr, csrRowPtr, xInd, xVal, y, Finalans);

__global__ void csr_spmv_kernel( const int *row_size_kr, // matrix row size
const int *csrRowPtr_kr, // row pointer
const int *xInd_kr, // Col index
const struct complex *xVal_kr, // values of the matrix
const struct complex *y_kr, // the vector
struct complex *Finalans_kr // final result
){

The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.

cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.

Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?

I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.

I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.

The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.

cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.

Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?

the error string returned by cudaGetErrorString() is "[b]unspecified launch failure[/b]", from what I read they say that this error is related to segmentation fault in the CUDA device.

How should I debug this error? I have been stuck with it for days.

the error string returned by cudaGetErrorString() is "unspecified launch failure", from what I read they say that this error is related to segmentation fault in the CUDA device.

How should I debug this error? I have been stuck with it for days.

I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.

I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.

http://www.nvidia.com/object/nvidia_research_pub_001.html

cuda tesla

I have used the CSR format for my matrix, also I have slightly modified the algorithm to handle complex numbers

struct complex {

double x;

double y;

};

// grid and thread size

int threadPerBlock = 512;

int numBlock=(VCTSIZE/threadPerBlock)+1; // VCTSIZE is the size of the vector to be multiplied = square matrix size

csr_spmv_kernel<<<numBlock, threadPerBlock>>>(row_size_ptr_kr, csrRowPtr, xInd, xVal, y, Finalans);

__global__ void csr_spmv_kernel( const int *row_size_kr, // matrix row size

const int *csrRowPtr_kr, // row pointer

const int *xInd_kr, // Col index

const struct complex *xVal_kr, // values of the matrix

const struct complex *y_kr, // the vector

struct complex *Finalans_kr // final result

){

int row, row_start, row_end, jj, row_size;

double dot_real, dot_img;

row_size = *row_size_kr;

row = blockDim.x * blockIdx.x + threadIdx.x; //blockDim.x*

if(row < row_size){

dot_real=0;

dot_img=0;

row_start = csrRowPtr_kr[row];

row_end = csrRowPtr_kr[row+1];

for(jj = row_start; jj < row_end; jj++ ){

dot_real += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].x) - (xVal_kr[jj].y * y_kr[xInd_kr[jj]].y);

dot_img += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].y) + (xVal_kr[jj].y * y_kr[xInd_kr[jj]].x);

}

Finalans_kr[row].x += dot_real;

Finalans_kr[row].y += dot_img;

}

}

The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.

cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.

Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?

I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.

I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.

http://www.nvidia.com/object/nvidia_research_pub_001.html

cuda tesla

I have used the CSR format for my matrix, also I have slightly modified the algorithm to handle complex numbers

struct complex {

double x;

double y;

};

// grid and thread size

int threadPerBlock = 512;

int numBlock=(VCTSIZE/threadPerBlock)+1; // VCTSIZE is the size of the vector to be multiplied = square matrix size

csr_spmv_kernel<<<numBlock, threadPerBlock>>>(row_size_ptr_kr, csrRowPtr, xInd, xVal, y, Finalans);

__global__ void csr_spmv_kernel( const int *row_size_kr, // matrix row size

const int *csrRowPtr_kr, // row pointer

const int *xInd_kr, // Col index

const struct complex *xVal_kr, // values of the matrix

const struct complex *y_kr, // the vector

struct complex *Finalans_kr // final result

){

int row, row_start, row_end, jj, row_size;

double dot_real, dot_img;

row_size = *row_size_kr;

row = blockDim.x * blockIdx.x + threadIdx.x; //blockDim.x*

if(row < row_size){

dot_real=0;

dot_img=0;

row_start = csrRowPtr_kr[row];

row_end = csrRowPtr_kr[row+1];

for(jj = row_start; jj < row_end; jj++ ){

dot_real += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].x) - (xVal_kr[jj].y * y_kr[xInd_kr[jj]].y);

dot_img += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].y) + (xVal_kr[jj].y * y_kr[xInd_kr[jj]].x);

}

Finalans_kr[row].x += dot_real;

Finalans_kr[row].y += dot_img;

}

}

The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.

cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.

Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?

the error string returned by cudaGetErrorString() is "[b]unspecified launch failure[/b]", from what I read they say that this error is related to segmentation fault in the CUDA device.

How should I debug this error? I have been stuck with it for days.

Thanks.

the error string returned by cudaGetErrorString() is "

unspecified launch failure", from what I read they say that this error is related to segmentation fault in the CUDA device.How should I debug this error? I have been stuck with it for days.

Thanks.