Example of matrix multiplication (max. block_size)
Hi all!

The more I learn the more questions I have…. :rolleyes:
I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (32*32 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card*. The questions are:

1. Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)
2. Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.
3. How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?
[attachment=15503:blocks.jpg]
4. If I have a larger matrix as the one in the example, how can I improve the example?

*here is the performance of fx1700

Quadro FX 1700
Memory Size 512MB
Memory Interface 128-bit
Graphic Memory Bandwidth 12.8 GB/sec.
Graphics Bus PCI Express 2.0
CUDA Parallel Processor Cores 32



Thx a lot!!
Hi all!



The more I learn the more questions I have…. :rolleyes:

I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (32*32 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card*. The questions are:



1. Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)

2. Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.

3. How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?

[attachment=15503:blocks.jpg]

4. If I have a larger matrix as the one in the example, how can I improve the example?



*here is the performance of fx1700



Quadro FX 1700

Memory Size 512MB

Memory Interface 128-bit

Graphic Memory Bandwidth 12.8 GB/sec.

Graphics Bus PCI Express 2.0

CUDA Parallel Processor Cores 32







Thx a lot!!
Attachments

blocks.jpg

#1
Posted 01/26/2010 09:12 PM   
[quote]Where can I know, which max. tiling size in a block can be supported by my graphic card?
(How much data can I copy from device memory to shared memory at once?)[/quote]

shared memory has 16kB per SM. if tile size is 32x32, then
two shared memory blocks As[32][32] and Bs[32][32] need 8kB,
so only one thread block can be put into SM.
( not 2 thread blocks per SM since parameters of kernel function
also occupy shared memory, so you can use <16KB shared memory)

however maximum number of threads per trhead block is 512, that's why
your program is crashed.

[quote]The parallel computations in cuda can run just in a block (threads parallelism).
The calculation of blocks is still sequential. It means, a block will run on GPU
just after complete work of another block.[/quote]

basic unit of threads parallelism is a warp (32 threads), a thread block would
be divided into several warps and warp scheduler of SM would select one warp
to execute in 8 SPs by round-robin.
You cannot say "The calculation of blocks is still sequential".
when a thread block is dispatched into one SM, then it would
not leave before work is complete.
Where can I know, which max. tiling size in a block can be supported by my graphic card?

(How much data can I copy from device memory to shared memory at once?)




shared memory has 16kB per SM. if tile size is 32x32, then

two shared memory blocks As[32][32] and Bs[32][32] need 8kB,

so only one thread block can be put into SM.

( not 2 thread blocks per SM since parameters of kernel function

also occupy shared memory, so you can use <16KB shared memory)



however maximum number of threads per trhead block is 512, that's why

your program is crashed.



The parallel computations in cuda can run just in a block (threads parallelism).

The calculation of blocks is still sequential. It means, a block will run on GPU

just after complete work of another block.




basic unit of threads parallelism is a warp (32 threads), a thread block would

be divided into several warps and warp scheduler of SM would select one warp

to execute in 8 SPs by round-robin.

You cannot say "The calculation of blocks is still sequential".

when a thread block is dispatched into one SM, then it would

not leave before work is complete.

Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien

#2
Posted 01/27/2010 05:30 PM   
[quote name='LSChien' post='990129' date='Jan 27 2010, 06:30 PM']shared memory has 16kB per SM. if tile size is 32x32, then
two shared memory blocks As[32][32] and Bs[32][32] need 8kB,
so only one thread block can be put into SM.
( not 2 thread blocks per SM since parameters of kernel function
also occupy shared memory, so you can use <16KB shared memory)

however maximum number of threads per trhead block is 512, that's why
your program is crashed.



basic unit of threads parallelism is a warp (32 threads), a thread block would
be divided into several warps and warp scheduler of SM would select one warp
to execute in 8 SPs by round-robin.
You cannot say "The calculation of blocks is still sequential".
when a thread block is dispatched into one SM, then it would
not leave before work is complete.[/quote]

thx a lot for u help! /shifty.gif' class='bbc_emoticon' alt=':shifty:' />
The conditions for speeding up are more than i thought :rolleyes:

can I say, the important benchmarks to select a GPU for GPGPU are:

1 the bandwidth of device memory
2 the peak flops
- size of SMs in a GPU (16 in G80)
- max. nr of threads in a SM
3 the size of shared memory for a block (16kb?)
4 the number of clock cycles to dispatch an instruction for threads in a warp

best regards
[quote name='LSChien' post='990129' date='Jan 27 2010, 06:30 PM']shared memory has 16kB per SM. if tile size is 32x32, then

two shared memory blocks As[32][32] and Bs[32][32] need 8kB,

so only one thread block can be put into SM.

( not 2 thread blocks per SM since parameters of kernel function

also occupy shared memory, so you can use <16KB shared memory)



however maximum number of threads per trhead block is 512, that's why

your program is crashed.







basic unit of threads parallelism is a warp (32 threads), a thread block would

be divided into several warps and warp scheduler of SM would select one warp

to execute in 8 SPs by round-robin.

You cannot say "The calculation of blocks is still sequential".

when a thread block is dispatched into one SM, then it would

not leave before work is complete.



thx a lot for u help! /shifty.gif' class='bbc_emoticon' alt=':shifty:' />

The conditions for speeding up are more than i thought :rolleyes:



can I say, the important benchmarks to select a GPU for GPGPU are:



1 the bandwidth of device memory

2 the peak flops

- size of SMs in a GPU (16 in G80)

- max. nr of threads in a SM

3 the size of shared memory for a block (16kb?)

4 the number of clock cycles to dispatch an instruction for threads in a warp



best regards

#3
Posted 01/28/2010 05:43 PM   
Scroll To Top