Example of matrix multiplication (max. block_size)

kemu · January 26, 2010, 9:12pm

Hi all!

The more I learn the more questions I haveâ€¦. :rolleyes:
I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (3232 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card. The questions are:

Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)
Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.
How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?

If I have a larger matrix as the one in the example, how can I improve the example?

*here is the performance of fx1700

Quadro FX 1700
Memory Size 512MB
Memory Interface 128-bit
Graphic Memory Bandwidth 12.8 GB/sec.
Graphics Bus PCI Express 2.0
CUDA Parallel Processor Cores 32

Thx a lot!!

LSChien · January 27, 2010, 5:30pm

shared memory has 16kB per SM. if tile size is 32x32, then

two shared memory blocks As[32][32] and Bs[32][32] need 8kB,

so only one thread block can be put into SM.

( not 2 thread blocks per SM since parameters of kernel function

also occupy shared memory, so you can use <16KB shared memory)

however maximum number of threads per trhead block is 512, that’s why

your program is crashed.

basic unit of threads parallelism is a warp (32 threads), a thread block would

be divided into several warps and warp scheduler of SM would select one warp

to execute in 8 SPs by round-robin.

You cannot say “The calculation of blocks is still sequential”.

when a thread block is dispatched into one SM, then it would

not leave before work is complete.

kemu · January 28, 2010, 5:43pm

thx a lot for u help! External Media

The conditions for speeding up are more than i thought :rolleyes:

can I say, the important benchmarks to select a GPU for GPGPU are:

1 the bandwidth of device memory

2 the peak flops

size of SMs in a GPU (16 in G80)
max. nr of threads in a SM

3 the size of shared memory for a block (16kb?)

4 the number of clock cycles to dispatch an instruction for threads in a warp

best regards