matrix multiplication can't achieve peak performanc
Dear All:

1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.
3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

Any answer will be hightly appreciated.

Thanks
Dear All:



1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.



I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.



At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.



2, Does this mean 1/3 is the best we can do for this matrix application problem.

3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.



Any answer will be hightly appreciated.



Thanks

#1
Posted 04/06/2012 12:30 AM   
It largely depends on how much effort you are willing to invest: [url="http://forums.nvidia.com/index.php?showtopic=47689"][1][/url] [url="http://forums.nvidia.com/index.php?showtopic=159033"][2][/url].

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.
It largely depends on how much effort you are willing to invest: [1] [2].



Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/06/2012 10:50 AM   
2) I'd recommend sticking with CUBLAS if you need matrix multiplication. NVIDIA improve its performance with every new release.

3) It's relatively easy to obtain 80%+ of peak [b]memory[/b] throughput. And in most real-world applications, this is the limiting factor.
2) I'd recommend sticking with CUBLAS if you need matrix multiplication. NVIDIA improve its performance with every new release.



3) It's relatively easy to obtain 80%+ of peak memory throughput. And in most real-world applications, this is the limiting factor.

#3
Posted 04/06/2012 04:12 PM   
Hi,
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.
Hi,

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops



Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.

#4
Posted 04/06/2012 04:19 PM   
[quote name='alexish' date='06 April 2012 - 11:19 AM' timestamp='1333729168' post='1392700']
Hi,
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.
[/quote]

Hi:

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.
When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with
2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.
If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

Thanks
[quote name='alexish' date='06 April 2012 - 11:19 AM' timestamp='1333729168' post='1392700']

Hi,

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops



Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.





Hi:



the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.



If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.



( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)



Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.



Thanks

#5
Posted 04/06/2012 06:52 PM   
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × [i]number of cores[/i] × [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × number of cores × frequency.



Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 04/06/2012 09:35 PM   
[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × [i]number of cores[/i] × [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.
[/quote]

Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,
then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.
where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?
[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × number of cores × frequency.



Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.





Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,

then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.

where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

#7
Posted 04/06/2012 10:19 PM   
Yes, it's the shader frequency, not the core frequency.
(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)
Yes, it's the shader frequency, not the core frequency.

(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#8
Posted 04/06/2012 10:29 PM   
[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × [i]number of cores[/i] × [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.
[/quote]
oh, I made mistak.

2*512*1.54 will give me the correct result
[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × number of cores × frequency.



Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.



oh, I made mistak.



2*512*1.54 will give me the correct result

#9
Posted 04/06/2012 10:30 PM   
[quote name='alexish' date='06 April 2012 - 09:19 AM' timestamp='1333729168' post='1392700']
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops
[/quote]

It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf slide 51 and on.

If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: http://icl.cs.utk.edu/magma/software/index.html
[quote name='alexish' date='06 April 2012 - 09:19 AM' timestamp='1333729168' post='1392700']

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops





It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf slide 51 and on.



If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: http://icl.cs.utk.edu/magma/software/index.html

#10
Posted 04/19/2012 09:58 AM   
Scroll To Top