1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.
3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.

3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

It largely depends on how much effort you are willing to invest: [url="http://forums.nvidia.com/index.php?showtopic=47689"][1][/url] [url="http://forums.nvidia.com/index.php?showtopic=159033"][2][/url].

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

It largely depends on how much effort you are willing to invest: [1][2].

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

Hi,
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.

[quote name='alexish' date='06 April 2012 - 11:19 AM' timestamp='1333729168' post='1392700']
Hi,
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.
[/quote]

Hi:

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.
When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with
2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.
If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

[quote name='alexish' date='06 April 2012 - 11:19 AM' timestamp='1333729168' post='1392700']

Hi,

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.

Hi:

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— [i]number of cores[/i] Ã— [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— number of cores Ã— frequency.

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— [i]number of cores[/i] Ã— [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.
[/quote]

Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,
then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.
where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— number of cores Ã— frequency.

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,

then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.

where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

Yes, it's the shader frequency, not the core frequency.
(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)

Yes, it's the shader frequency, not the core frequency.

(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— [i]number of cores[/i] Ã— [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.
[/quote]
oh, I made mistak.

[quote name='tera' date='06 April 2012 - 04:35 PM' timestamp='1333748148' post='1392807']

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— number of cores Ã— frequency.

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

[quote name='alexish' date='06 April 2012 - 09:19 AM' timestamp='1333729168' post='1392700']
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
1. naive implementation without shared memory: 110 GFlops
2. implementation with shared memory (as in programming guide); 230 GFlops
3. Cublas : 840 GFlops
[/quote]

It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf slide 51 and on.

If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: http://icl.cs.utk.edu/magma/software/index.html

1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.

3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

Any answer will be hightly appreciated.

Thanks

1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.

3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

Any answer will be hightly appreciated.

Thanks

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

3) It's relatively easy to obtain 80%+ of peak [b]memory[/b] throughput. And in most real-world applications, this is the limiting factor.

3) It's relatively easy to obtain 80%+ of peak

memorythroughput. And in most real-world applications, this is the limiting factor.i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.

Hi,

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).

An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires

n^3 multiplications and n^3 additions so 2n^3 flop.

[/quote]

Hi:

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

Thanks

Hi,

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).

n^3 multiplications and n^3 additions so 2n^3 flop.

Hi:

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

Thanks

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

number of coresÃ—frequency.Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— [i]number of cores[/i] Ã— [i]frequency[/i].

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

[/quote]

Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,

then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.

where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã—

number of coresÃ—frequency.Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,

then I will get 2*512*0.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.

where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)

(another case where Nvidia's renaming of the ALUs/FPUs to "cores" creates unnecessary confusion)

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— [i]number of cores[/i] Ã— [i]frequency[/i].

[/quote]

oh, I made mistak.

2*512*1.54 will give me the correct result

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã—

number of coresÃ—frequency.oh, I made mistak.

2*512*1.54 will give me the correct result

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

[/quote]

It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf slide 51 and on.

If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: http://icl.cs.utk.edu/magma/software/index.html

1. naive implementation without shared memory: 110 GFlops

2. implementation with shared memory (as in programming guide); 230 GFlops

3. Cublas : 840 GFlops

It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf slide 51 and on.

If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: http://icl.cs.utk.edu/magma/software/index.html