Hello All,
I have been working with CUDA Tesla card for the last 3 months. The journey has been really challenging and fun at times.
I have tried to replicate the Efficient Sparse Matrix-Vector Multiplication Algorithm as given by Nathan Bell and Michael Garland in the white paper dated Dec 11, 2008.
cuda tesla
I have used the CSR format for my matrix, also I have slightly modified the algorithm to handle complex numbers
struct complex {
double x;
double y;
};
// grid and thread size
int threadPerBlock = 512;
int numBlock=(VCTSIZE/threadPerBlock)+1; // VCTSIZE is the size of the vector to be multiplied = square matrix size
csr_spmv_kernel<<<numBlock, threadPerBlock>>>(row_size_ptr_kr, csrRowPtr, xInd, xVal, y, Finalans);
global void csr_spmv_kernel( const int *row_size_kr, // matrix row size
const int *csrRowPtr_kr, // row pointer
const int *xInd_kr, // Col index
const struct complex *xVal_kr, // values of the matrix
const struct complex *y_kr, // the vector
struct complex *Finalans_kr // final result
){
int row, row_start, row_end, jj, row_size;
double dot_real, dot_img;
row_size = *row_size_kr;
row = blockDim.x * blockIdx.x + threadIdx.x; //blockDim.x*
if(row < row_size){
dot_real=0;
dot_img=0;
row_start = csrRowPtr_kr[row];
row_end = csrRowPtr_kr[row+1];
for(jj = row_start; jj < row_end; jj++ ){
dot_real += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].x) - (xVal_kr[jj].y * y_kr[xInd_kr[jj]].y);
dot_img += (xVal_kr[jj].x * y_kr[xInd_kr[jj]].y) + (xVal_kr[jj].y * y_kr[xInd_kr[jj]].x);
}
Finalans_kr[row].x += dot_real;
Finalans_kr[row].y += dot_img;
}
}
The Algorithm work fine upto Matrix and Vector size 300000x300000 and 300000. But if I change the Matrix and Vector size to say 350000 the kernel fails.
cudaDeviceSynchronize(); just after the kernel invocation reports an error from the kernel operation.
Can anyone give an insight why the algorithm works for small matrix size and fails as the matrix size is increased ?