my speedy SGEMM
  1 / 7    
I've tried implementing a matrix multiply that's as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?

I tried very hard to optimize it, and experimented with many techniques. I'm afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.
I've tried implementing a matrix multiply that's as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?



I tried very hard to optimize it, and experimented with many techniques. I'm afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.

#1
Posted 10/05/2007 11:50 PM   
[quote name='alex_dubinsky' date='Oct 5 2007, 04:50 PM']Could somebody who has an 8800GTX get some numbers?
[right][snapback]261174[/snapback][/right]
[/quote]
137. Nice Alex!

-dh
[quote name='alex_dubinsky' date='Oct 5 2007, 04:50 PM']Could somebody who has an 8800GTX get some numbers?

[snapback]261174[/snapback]




137. Nice Alex!



-dh

#2
Posted 10/06/2007 12:28 AM   
I get 137.9 Gflop/s on 8800 GTX. CUBLAS runs at 120.
I get 137.9 Gflop/s on 8800 GTX. CUBLAS runs at 120.

#3
Posted 10/06/2007 04:14 PM   
[quote]Could somebody who has an 8800GTX get some numbers?[/quote]
I get 171.68 Gflops, using a 8800 GTX ULTRA.
Nice work!
Could somebody who has an 8800GTX get some numbers?


I get 171.68 Gflops, using a 8800 GTX ULTRA.

Nice work!

#4
Posted 10/07/2007 07:41 AM   
[quote name='nutti' date='Oct 7 2007, 03:41 AM']I get 171.68 Gflops, using a 8800 GTX ULTRA.
Nice work!
[right][snapback]261572[/snapback][/right]
[/quote]
wow! how can such a difference be explained?

see, something about the compiiling/optimizing is just so chaotic.
[quote name='nutti' date='Oct 7 2007, 03:41 AM']I get 171.68 Gflops, using a 8800 GTX ULTRA.

Nice work!

[snapback]261572[/snapback]




wow! how can such a difference be explained?



see, something about the compiiling/optimizing is just so chaotic.

#5
Posted 10/07/2007 05:34 PM   
[quote name='alex_dubinsky' date='Oct 7 2007, 10:34 AM']wow! how can such a difference be explained?

see, something about the compiiling/optimizing is just so chaotic.
[right][snapback]261705[/snapback][/right]
[/quote]

I doubt he even recompiled, probably used your binary.

This is a nice example how your excellent program scaled, without recompiling, across the 8-series family with different numbers of processors and different clocks.

The Ultra has faster shader, memory clocks and bandwidth than the GTX, so the 137 to 171 is just straight clock scaling.
[quote name='alex_dubinsky' date='Oct 7 2007, 10:34 AM']wow! how can such a difference be explained?



see, something about the compiiling/optimizing is just so chaotic.

[snapback]261705[/snapback]






I doubt he even recompiled, probably used your binary.



This is a nice example how your excellent program scaled, without recompiling, across the 8-series family with different numbers of processors and different clocks.



The Ultra has faster shader, memory clocks and bandwidth than the GTX, so the 137 to 171 is just straight clock scaling.

#6
Posted 10/07/2007 05:44 PM   
[quote name='dhoff' date='Oct 7 2007, 01:44 PM']I doubt he even recompiled, probably used your binary.

This is a nice example how your excellent program scaled, without recompiling, across the 8-series family with different numbers of processors and different clocks.

The Ultra has faster shader, memory clocks and bandwidth  than the GTX, so the 137 to 171 is just straight clock scaling.
[right][snapback]261711[/snapback][/right]
[/quote]
172/138 = 25%, while the ultra has 6.5% faster shaders and 20% memory. something else must be going on.

btw, the code gets recompiled automatically because the executable only has cm_10 ptx and sm_11 cubin embedded (it's an artifact of my optimizing).
[quote name='dhoff' date='Oct 7 2007, 01:44 PM']I doubt he even recompiled, probably used your binary.



This is a nice example how your excellent program scaled, without recompiling, across the 8-series family with different numbers of processors and different clocks.



The Ultra has faster shader, memory clocks and bandwidth  than the GTX, so the 137 to 171 is just straight clock scaling.

[snapback]261711[/snapback]




172/138 = 25%, while the ultra has 6.5% faster shaders and 20% memory. something else must be going on.



btw, the code gets recompiled automatically because the executable only has cm_10 ptx and sm_11 cubin embedded (it's an artifact of my optimizing).

#7
Posted 10/07/2007 08:39 PM   
I've also got the Ultra.

When I don't recompile the project (just launch the binary in the Release folder) I get 155 GFlops. Recompiling the Project in Release Mode gives me the same number.

What's going on, why is my card slower?
I use it not only for the computation but it's also my display adapter so it also renders my desktop. Might this impact performance?

Btw, nice work Alex!
I've also got the Ultra.



When I don't recompile the project (just launch the binary in the Release folder) I get 155 GFlops. Recompiling the Project in Release Mode gives me the same number.



What's going on, why is my card slower?

I use it not only for the computation but it's also my display adapter so it also renders my desktop. Might this impact performance?



Btw, nice work Alex!

#8
Posted 10/07/2007 09:14 PM   
So, this code is 65% faster than CUBLAS when run on 8600GT OC, but only 14% faster on 8800 GTX. I wonder how it compares to CUBLAS on 8800 GTX Ultra. Any info?

Also, I am not sure that the timing used in this code is fair. I'd call cudaThreadSynchronize() after each kernel invocation, inside the loop. If you invoke kernel again before previous pass is not over, it may go wrong. This could be the cause for the 171 vs 155 case.
So, this code is 65% faster than CUBLAS when run on 8600GT OC, but only 14% faster on 8800 GTX. I wonder how it compares to CUBLAS on 8800 GTX Ultra. Any info?



Also, I am not sure that the timing used in this code is fair. I'd call cudaThreadSynchronize() after each kernel invocation, inside the loop. If you invoke kernel again before previous pass is not over, it may go wrong. This could be the cause for the 171 vs 155 case.

#9
Posted 10/07/2007 09:39 PM   
[quote name='vvolkov' date='Oct 7 2007, 02:39 PM']...
Also, I am not sure that the timing used in this code is fair. I'd call cudaThreadSynchronize() after each kernel invocation, inside the loop. If you invoke kernel again before previous pass is not over, it may go wrong. This could be the cause for the 171 vs 155 case.
[right][snapback]261801[/snapback][/right]
[/quote]

Calling cudaThreadSynchronize() in CUDA 1.0 after each kernel call isn't necessary. Subsequent kernel calls get serialized by the driver. So, it suffices to call cudaThreadSynchronize() once, after the timing loop.

Paulius
[quote name='vvolkov' date='Oct 7 2007, 02:39 PM']...

Also, I am not sure that the timing used in this code is fair. I'd call cudaThreadSynchronize() after each kernel invocation, inside the loop. If you invoke kernel again before previous pass is not over, it may go wrong. This could be the cause for the 171 vs 155 case.

[snapback]261801[/snapback]






Calling cudaThreadSynchronize() in CUDA 1.0 after each kernel call isn't necessary. Subsequent kernel calls get serialized by the driver. So, it suffices to call cudaThreadSynchronize() once, after the timing loop.



Paulius

#10
Posted 10/08/2007 05:22 PM   
Alex,
nice code optimization but what you have coded is not a real SGEMM.
SGEMM performs C=alpha*A*B+beta*C.

CUBLAS achieves 120Gflops in CUDA 1.0 for SGEMM and it will improve in the upcoming release.
Alex,

nice code optimization but what you have coded is not a real SGEMM.

SGEMM performs C=alpha*A*B+beta*C.



CUBLAS achieves 120Gflops in CUDA 1.0 for SGEMM and it will improve in the upcoming release.

#11
Posted 10/09/2007 05:11 AM   
[quote name='paulius' date='Oct 8 2007, 09:22 AM']Calling cudaThreadSynchronize() in CUDA 1.0 after each kernel call isn't necessary.  Subsequent kernel calls get serialized by the driver.  So, it suffices to call cudaThreadSynchronize() once, after the timing loop.

Paulius
[right][snapback]262084[/snapback][/right]
[/quote]

This contradicts with my observations with CUDA 1.0.

To be more specific, I took the bandwidth benchmark posted in another thread (http://forums.nvidia.com/index.php?showtopic=44152&view=findpost&p=246974), and inserted cutGetTimerValue() in the timing loop to measure the time taken by each iteration. First ~20 iterations take ~13 us each which is just the kernel invocation overhead, later iterations take ~2.9 ms each which is the expected time and is the same as when using cudaThreadSynchronize().
[quote name='paulius' date='Oct 8 2007, 09:22 AM']Calling cudaThreadSynchronize() in CUDA 1.0 after each kernel call isn't necessary.  Subsequent kernel calls get serialized by the driver.  So, it suffices to call cudaThreadSynchronize() once, after the timing loop.



Paulius

[snapback]262084[/snapback]






This contradicts with my observations with CUDA 1.0.



To be more specific, I took the bandwidth benchmark posted in another thread (http://forums.nvidia.com/index.php?showtopic=44152&view=findpost&p=246974), and inserted cutGetTimerValue() in the timing loop to measure the time taken by each iteration. First ~20 iterations take ~13 us each which is just the kernel invocation overhead, later iterations take ~2.9 ms each which is the expected time and is the same as when using cudaThreadSynchronize().

#12
Posted 10/09/2007 11:29 AM   
[quote name='vvolkov' date='Oct 9 2007, 06:29 AM']This contradicts with my observations with CUDA 1.0.

To be more specific, I took the bandwidth benchmark posted in another thread (http://forums.nvidia.com/index.php?showtopic=44152&view=findpost&p=246974), and inserted cutGetTimerValue() in the timing loop to measure the time taken by each iteration. First ~20 iterations take ~13 us each which is just the kernel invocation overhead, later iterations take ~2.9 ms each which is the expected time and is the same as when using cudaThreadSynchronize().[right][snapback]262461[/snapback][/right][/quote]
There seems to be a misunderstanding. When you want to time each kernel run you have to make sure the kernel finished. You do that by calling cudaThreadSynchronize().
What I think Paulius is saying is that you can call say 10 kernels without the cudaThreadSynchronize and CUDA will queue the calls and execute them one by one in a FIFO manner (I guess). You don't have to call cudaThreadSynchronize() between each kernel invocation to make sure it is executed properly etc. You just have to call it once right before you take the time to make sure everything is finished. That's if you time more kernels per run.

So you are right: basically I think one can say before taking the time there should be a call to cudaThreadSynchronize(). I think it would actually make sense to implicitly invoke cudaThreadSynchronize() functionality when calling cutGetTimerValue(). So no one would get confused. A lot of people have measured wrong times because of the obscure asynchronous behavior of CUDA, including myself.
[quote name='vvolkov' date='Oct 9 2007, 06:29 AM']This contradicts with my observations with CUDA 1.0.



To be more specific, I took the bandwidth benchmark posted in another thread (http://forums.nvidia.com/index.php?showtopic=44152&view=findpost&p=246974), and inserted cutGetTimerValue() in the timing loop to measure the time taken by each iteration. First ~20 iterations take ~13 us each which is just the kernel invocation overhead, later iterations take ~2.9 ms each which is the expected time and is the same as when using cudaThreadSynchronize().
[snapback]262461[/snapback]


There seems to be a misunderstanding. When you want to time each kernel run you have to make sure the kernel finished. You do that by calling cudaThreadSynchronize().

What I think Paulius is saying is that you can call say 10 kernels without the cudaThreadSynchronize and CUDA will queue the calls and execute them one by one in a FIFO manner (I guess). You don't have to call cudaThreadSynchronize() between each kernel invocation to make sure it is executed properly etc. You just have to call it once right before you take the time to make sure everything is finished. That's if you time more kernels per run.



So you are right: basically I think one can say before taking the time there should be a call to cudaThreadSynchronize(). I think it would actually make sense to implicitly invoke cudaThreadSynchronize() functionality when calling cutGetTimerValue(). So no one would get confused. A lot of people have measured wrong times because of the obscure asynchronous behavior of CUDA, including myself.

#13
Posted 10/09/2007 04:06 PM   
OK, i've got it. They are queued. And in my case I get 10 kernel calls in a queue.

I though Paulius meant that there is an implicit call to cudaThreadSynchronize() before each kernel invocation.

Thanks!
OK, i've got it. They are queued. And in my case I get 10 kernel calls in a queue.



I though Paulius meant that there is an implicit call to cudaThreadSynchronize() before each kernel invocation.



Thanks!

#14
Posted 10/09/2007 04:30 PM   
[quote name='mfatica' date='Oct 9 2007, 01:11 AM']Alex,
nice code optimization but what you have coded is not a real SGEMM.
SGEMM performs C=alpha*A*B+beta*C.

CUBLAS achieves 120Gflops in CUDA 1.0 for SGEMM and it will improve in the upcoming release.
[right][snapback]262369[/snapback][/right]
[/quote]
You're right. A real sgemm includes alpha and beta, and supports various transpose modes. Alphas and betas are trivial, but the transpose modes will require more work. In a proper implementation, however, the extra features will have little effect on performance (especially when they're not used).

I'll work on expanding my function to be a full sgemm.
[quote name='mfatica' date='Oct 9 2007, 01:11 AM']Alex,

nice code optimization but what you have coded is not a real SGEMM.

SGEMM performs C=alpha*A*B+beta*C.



CUBLAS achieves 120Gflops in CUDA 1.0 for SGEMM and it will improve in the upcoming release.

[snapback]262369[/snapback]




You're right. A real sgemm includes alpha and beta, and supports various transpose modes. Alphas and betas are trivial, but the transpose modes will require more work. In a proper implementation, however, the extra features will have little effect on performance (especially when they're not used).



I'll work on expanding my function to be a full sgemm.

#15
Posted 10/09/2007 10:08 PM   
  1 / 7    
Scroll To Top