How to improve this matrix multiplication code in CUDA?

Hi, I’m implement a multiplication code from a book.
But, I was in a trouble because the size of the matrix could not be large…

I assume that this matrix is square matrix and A X B = C → [NN] X [NN] = [N*N]
and Also I use a tile for block index.

How can I set the size of the matrix to be 1000 X 1000 or more and more.
((I use Titan X and this GPU can support 1024 thread per block))

#include “device_launch_parameters.h”
#include “book.h”
#include “cuda_runtime.h”

#define N 128
#define TILE_WIDTH 16

global void matMulkernel(int *a, int *b, int *c, int width){

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

for(int k = 0; k < width ; ++k)
	Pvalue += a[Row*width+k] * b[k*width+Col];


c[Row*width+Col] = Pvalue;

}

int main(){

int a[N*N], b[N*N], c[N*N];
int *dev_a, *dev_b, *dev_c;
int _size = N*N*sizeof(int);

for(int i = 0; i < N*N; i++){
	a[i] = 2;
	b[i] = 3;
	c[i] = 0;
}

HANDLE_ERROR(cudaMalloc((void**)&dev_a, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_b, _size));
HANDLE_ERROR(cudaMalloc((void**)&dev_c, _size));

HANDLE_ERROR(cudaMemcpy(dev_a, a, _size, cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b, b, _size, cudaMemcpyHostToDevice));

dim3 dimGrids(N/TILE_WIDTH, N/TILE_WIDTH);
dim3 dimBlocks(TILE_WIDTH, TILE_WIDTH);

matMulkernel<<<dimGrids, dimBlocks>>>(dev_a, dev_b, dev_c, N);

HANDLE_ERROR(cudaMemcpy(c, dev_c, _size, cudaMemcpyDeviceToHost));

for(int i = 0; i< N*N; i++)
	printf("%d \n", c[i]);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

return 0;

}

“How can I set the size of the matrix to be 1000 X 1000 or more and more”

well, what is the biggest size you currently achieve?
and what is constraining this?
i suspect device global memory, as your kernel is conditional on the tile size mostly, and not local/ shared memory or sm characteristics

little_jimmy
thanks your reply!

the biggest size I got is N → 280, tile → 10 or 20 ( tile can be other values)
when I just tried N to 290 or 300, it didn’t work.

And, I couldn’t understand what you mean "I suspect~ ".

If you don’t mind, I would like you to explain what I am wrong.
Thank you!

i actually expected a N >> 280

“when I just tried N to 290 or 300, it didn’t work”

in what way? what happened? an output error message perhaps?

This is going to be a problem for large N:

int a[N*N], b[N*N], c[N*N];

stack based variables have size limits that are much lower than dynamically allocated heap based variables.

If you replace the above line of code with the following:

int *a, *b, *c;
a = (int *)malloc(N*N*sizeof(int));
b = (int *)malloc(N*N*sizeof(int));
c = (int *)malloc(N*N*sizeof(int));

I think you’ll have better results.

And as little_jimmy said, “it didn’t work” is not very helpful in a forum like this. Be specific.
Probably best just to paste the actual error output you are getting into your question.

little_jimmy
Thank you for your reply and I need to be more specific to get more information!
Your answer is so helpful for me!

txbob
thanks txbob!
I’ve fixed my problem by your advice!
Then, as you mentioned above, I need to be specific for getting good feedback.
I will do it next time, thank you!