How to improve this matrix multiplication code in CUDA?

joy4162 · July 1, 2015, 7:45am

Hi, I’m implement a multiplication code from a book.
But, I was in a trouble because the size of the matrix could not be large…

I assume that this matrix is square matrix and A X B = C → [NN] X [NN] = [N*N]
and Also I use a tile for block index.

How can I set the size of the matrix to be 1000 X 1000 or more and more.
((I use Titan X and this GPU can support 1024 thread per block))

#include “device_launch_parameters.h”
#include “book.h”
#include “cuda_runtime.h”

#define N 128
#define TILE_WIDTH 16

global void matMulkernel(int *a, int *b, int *c, int width){

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

for(int k = 0; k < width ; ++k)
	Pvalue += a[Row*width+k] * b[k*width+Col];


c[Row*width+Col] = Pvalue;

}

int main(){

int a[N*N], b[N*N], c[N*N];
int *dev_a, *dev_b, *dev_c;
int _size = N*N*sizeof(int);

for(int i = 0; i < N*N; i++){
	a[i] = 2;
	b[i] = 3;
	c[i] = 0;
}

HANDLE_ERROR(cudaMalloc((void**)&dev_a, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_b, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_c, _size));

HANDLE_ERROR(cudaMemcpy(dev_a, a, _size, cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b, b, _size, cudaMemcpyHostToDevice));

dim3 dimGrids(N/TILE_WIDTH, N/TILE_WIDTH);
dim3 dimBlocks(TILE_WIDTH, TILE_WIDTH);

matMulkernel<<<dimGrids, dimBlocks>>>(dev_a, dev_b, dev_c, N);

HANDLE_ERROR(cudaMemcpy(c, dev_c, _size, cudaMemcpyDeviceToHost));

for(int i = 0; i< N*N; i++)
	printf("%d \n", c[i]);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

return 0;

}

little_jimmy · July 1, 2015, 9:50am

“How can I set the size of the matrix to be 1000 X 1000 or more and more”

well, what is the biggest size you currently achieve?
and what is constraining this?
i suspect device global memory, as your kernel is conditional on the tile size mostly, and not local/ shared memory or sm characteristics

joy4162 · July 1, 2015, 10:12am

little_jimmy
thanks your reply!

the biggest size I got is N → 280, tile → 10 or 20 ( tile can be other values)
when I just tried N to 290 or 300, it didn’t work.

And, I couldn’t understand what you mean "I suspect~ ".

If you don’t mind, I would like you to explain what I am wrong.
Thank you!

little_jimmy · July 1, 2015, 12:08pm

i actually expected a N >> 280

“when I just tried N to 290 or 300, it didn’t work”

in what way? what happened? an output error message perhaps?

Robert_Crovella · July 1, 2015, 1:52pm

This is going to be a problem for large N:

int a[N*N], b[N*N], c[N*N];

stack based variables have size limits that are much lower than dynamically allocated heap based variables.

If you replace the above line of code with the following:

int *a, *b, *c;
a = (int *)malloc(N*N*sizeof(int));
b = (int *)malloc(N*N*sizeof(int));
c = (int *)malloc(N*N*sizeof(int));

I think you’ll have better results.

And as little_jimmy said, “it didn’t work” is not very helpful in a forum like this. Be specific.
Probably best just to paste the actual error output you are getting into your question.

joy4162 · July 2, 2015, 1:50am

little_jimmy
Thank you for your reply and I need to be more specific to get more information!
Your answer is so helpful for me!

joy4162 · July 2, 2015, 1:54am

txbob
thanks txbob!
I’ve fixed my problem by your advice!
Then, as you mentioned above, I need to be specific for getting good feedback.
I will do it next time, thank you!