LU, QR and Cholesky factorizations using GPU
  1 / 7    
I'd like to share an implementation of LAPACK's routines [url="http://www.netlib.org/lapack/single/sgetrf.f"]SGETRF[/url], [url="http://www.netlib.org/lapack/single/spotrf.f"]SPOTRF[/url], and [url="http://www.netlib.org/lapack/single/sgeqrf.f"]SGEQRF[/url] that is accelerated using GPU. This implementation is limited to factorization of square matrices that reside in the host memory (i.e. at the CPU side). The following figure shows the sustained performance on the following platform: Intel Core2 Quad 2.83 GHz (Q9550), PCIe 2.0 x16, Intel MKL 10.1, Windows XP 64-bit, NVIDIA driver 181.20, CUDA 2.1:
[center][img]http://www.eecs.berkeley.edu/~volkov/glapack/performance.png[/img][/center]

The implementation follows the description given in the following paper; however, some of the finer tunings described, such as recursive and variable blocking, are not included in this release:[indent]Volkov, V., and Demmel, J. W. 2008. [url="http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf"]Benchmarking GPUs to tune dense linear algebra[/url], [i]SC08[/i].[/indent]Regards,

Vasily

05/02/09 edit: updated dead URL to the paper.
I'd like to share an implementation of LAPACK's routines SGETRF, SPOTRF, and SGEQRF that is accelerated using GPU. This implementation is limited to factorization of square matrices that reside in the host memory (i.e. at the CPU side). The following figure shows the sustained performance on the following platform: Intel Core2 Quad 2.83 GHz (Q9550), PCIe 2.0 x16, Intel MKL 10.1, Windows XP 64-bit, NVIDIA driver 181.20, CUDA 2.1:

Image




The implementation follows the description given in the following paper; however, some of the finer tunings described, such as recursive and variable blocking, are not included in this release:[indent]Volkov, V., and Demmel, J. W. 2008. Benchmarking GPUs to tune dense linear algebra, SC08.[/indent]Regards,



Vasily



05/02/09 edit: updated dead URL to the paper.

#1
Posted 02/09/2009 03:53 PM   
Thank you very much!
For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.
Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.
But I did not find anyone have ever measure which method is better on GPU.
Thank you very much!

For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.

Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.

But I did not find anyone have ever measure which method is better on GPU.

#2
Posted 02/09/2009 04:36 PM   
In fact, I am working on a Givens rotation version of QR decomposition.
Maybe we can compare whose solution is faster : )
In fact, I am working on a Givens rotation version of QR decomposition.

Maybe we can compare whose solution is faster : )

#3
Posted 02/09/2009 04:40 PM   
[quote name='zhenyu' post='503320' date='Feb 9 2009, 08:36 AM']Thank you very much!
For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.
Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.
But I did not find anyone have ever measure which method is better on GPU.[/quote]

I use block Householder update as done in LAPACK. It is BLAS3, so runs as fast as GEMM does. I wonder if you can do better.

Vasily
[quote name='zhenyu' post='503320' date='Feb 9 2009, 08:36 AM']Thank you very much!

For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.

Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.

But I did not find anyone have ever measure which method is better on GPU.



I use block Householder update as done in LAPACK. It is BLAS3, so runs as fast as GEMM does. I wonder if you can do better.



Vasily

#4
Posted 02/09/2009 04:58 PM   
Many thanks! You could make and lay out too most for Double.
Many thanks! You could make and lay out too most for Double.

#5
Posted 02/11/2009 03:23 PM   
With routines such as these we are ever so close to having functional "sgetrs" which calls on the existing "strsm" and the
simple, but not yet existing "slaswp". The combination sgetrf and sgetrs solves the equation Ax=b for x, i.e., x=A\b. This being
a holy grail at the moment.

I have hardware one step below the Q9550/gtx 280: a Q6600 quadcore cpu and a gtx 260. I get the following:

[codebox]
...glapack> ./benchmark

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 14.83 0.80 42.96 34.48 54.31 8.78
2000 101.17 1.07 97.62 60.93 123.00 12.67
3000 140.38 1.21 130.77 80.04 150.68 13.79
4000 111.16 0.94 101.29 106.74 168.95 16.81
5000 174.11 1.53 154.04 124.38 188.27 17.73
6000 172.13 1.43 173.10 146.37 196.90 20.60
7000 180.64 1.68 173.76 159.69 202.71 21.18
8000 190.27 1.61 180.69 193.38 207.50 22.29
9000 194.35 1.50 187.24 206.19 212.15 25.96
10000 198.41 1.67 192.23 225.67 215.90 27.75
11000 199.69 1.78 194.05 238.32 220.92 26.88
[/codebox]

I am somewhat stunned that the 260 is only about 2/3 as fast as the 280 for this benchmark. Perhaps it is the cpu/gpu combination that is conspiring to be slower? I have 8 GB of slowish ram in my system, preferring lots of ram over fast ram. Perhaps the code has some special tuning for the 280?

[codebox]
... glapack> ./benchmark -cpu

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 12.95 0.87 32.01 24.60 39.90 6.47
2000 32.06 0.97 36.39 53.54 51.76 6.71
3000 38.37 0.90 44.59 81.45 47.21 9.00
4000 48.96 0.85 45.72 98.10 49.07 7.62
5000 47.45 1.11 42.56 125.48 50.32 11.28
6000 46.80 1.21 42.53 155.80 51.31 10.47
7000 46.76 1.17 51.04 166.25 51.59 13.42
8000 40.01 1.19 52.32 197.28 52.47 14.37
9000 48.41 1.18 43.29 223.64 52.66 13.83
10000 48.89 1.21 53.09 244.25 42.80 16.26
11000 51.22 1.18 43.80 265.48 52.91 16.33
12000 50.13 1.23 43.68 300.44 43.16 17.32
13000 40.73 1.20 43.53 300.52 43.38 19.32
14000 40.94 1.22 44.06 335.17 43.21 19.20
15000 41.40 1.32 43.36 346.97 42.79 18.02
[/codebox]

I've toyed with upgrading to a Q9550 but I am not sure it is worth the $300 it would take... I paid $400 for my gtx 260 last June which brings tears to my eyes now...
With routines such as these we are ever so close to having functional "sgetrs" which calls on the existing "strsm" and the

simple, but not yet existing "slaswp". The combination sgetrf and sgetrs solves the equation Ax=b for x, i.e., x=A\b. This being

a holy grail at the moment.



I have hardware one step below the Q9550/gtx 280: a Q6600 quadcore cpu and a gtx 260. I get the following:



[codebox]

...glapack> ./benchmark



Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.



Errors reported are 1-norms of the residual such as ||A-QR||_1.



Cholesky LU QR

-------------- -------------- --------------

N Gflop/s error Gflop/s error Gflop/s error

----- -------------- -------------- --------------

1000 14.83 0.80 42.96 34.48 54.31 8.78

2000 101.17 1.07 97.62 60.93 123.00 12.67

3000 140.38 1.21 130.77 80.04 150.68 13.79

4000 111.16 0.94 101.29 106.74 168.95 16.81

5000 174.11 1.53 154.04 124.38 188.27 17.73

6000 172.13 1.43 173.10 146.37 196.90 20.60

7000 180.64 1.68 173.76 159.69 202.71 21.18

8000 190.27 1.61 180.69 193.38 207.50 22.29

9000 194.35 1.50 187.24 206.19 212.15 25.96

10000 198.41 1.67 192.23 225.67 215.90 27.75

11000 199.69 1.78 194.05 238.32 220.92 26.88

[/codebox]



I am somewhat stunned that the 260 is only about 2/3 as fast as the 280 for this benchmark. Perhaps it is the cpu/gpu combination that is conspiring to be slower? I have 8 GB of slowish ram in my system, preferring lots of ram over fast ram. Perhaps the code has some special tuning for the 280?



[codebox]

... glapack> ./benchmark -cpu



Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.



Errors reported are 1-norms of the residual such as ||A-QR||_1.



Cholesky LU QR

-------------- -------------- --------------

N Gflop/s error Gflop/s error Gflop/s error

----- -------------- -------------- --------------

1000 12.95 0.87 32.01 24.60 39.90 6.47

2000 32.06 0.97 36.39 53.54 51.76 6.71

3000 38.37 0.90 44.59 81.45 47.21 9.00

4000 48.96 0.85 45.72 98.10 49.07 7.62

5000 47.45 1.11 42.56 125.48 50.32 11.28

6000 46.80 1.21 42.53 155.80 51.31 10.47

7000 46.76 1.17 51.04 166.25 51.59 13.42

8000 40.01 1.19 52.32 197.28 52.47 14.37

9000 48.41 1.18 43.29 223.64 52.66 13.83

10000 48.89 1.21 53.09 244.25 42.80 16.26

11000 51.22 1.18 43.80 265.48 52.91 16.33

12000 50.13 1.23 43.68 300.44 43.16 17.32

13000 40.73 1.20 43.53 300.52 43.38 19.32

14000 40.94 1.22 44.06 335.17 43.21 19.20

15000 41.40 1.32 43.36 346.97 42.79 18.02

[/codebox]



I've toyed with upgrading to a Q9550 but I am not sure it is worth the $300 it would take... I paid $400 for my gtx 260 last June which brings tears to my eyes now...

#6
Posted 02/15/2009 01:29 PM   
As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.

Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?
As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.



Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?

#7
Posted 02/15/2009 01:50 PM   
vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.
vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.

#8
Posted 02/15/2009 08:24 PM   
[quote name='frea' post='506197' date='Feb 15 2009, 12:24 PM']vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.[/quote]
Here are the time results for QR on 8800GTX:
[code]n 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
seconds 0.0194 0.0918 0.256 0.566 1.05 1.74 2.71 3.94 5.56 7.55 9.92 12.8 16.3[/code]I used formula: Gflop/s rate = 4e-9*n*n*n/3/seconds.
[quote name='frea' post='506197' date='Feb 15 2009, 12:24 PM']vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.

Here are the time results for QR on 8800GTX:

n			 1000	2000   3000   4000  5000  6000  7000  8000  9000  10000  11000  12000  13000

seconds 0.0194 0.0918 0.256 0.566 1.05 1.74 2.71 3.94 5.56 7.55 9.92 12.8 16.3
I used formula: Gflop/s rate = 4e-9*n*n*n/3/seconds.

#9
Posted 02/15/2009 09:10 PM   
[quote name='vvolkov' post='506014' date='Feb 15 2009, 05:50 AM']As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.

Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?[/quote]

I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800. I run the cpu at normal speed. I ran this benchmark using the latest 180.29 version of the nvidia driver, and I ran the benchmark with X turned off with "init 3". I think that may be all the relevant information...

Thanks for the factorizations!
[quote name='vvolkov' post='506014' date='Feb 15 2009, 05:50 AM']As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.



Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?



I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800. I run the cpu at normal speed. I ran this benchmark using the latest 180.29 version of the nvidia driver, and I ran the benchmark with X turned off with "init 3". I think that may be all the relevant information...



Thanks for the factorizations!

#10
Posted 02/15/2009 11:03 PM   
[quote name='Boxed Cylon' post='506254' date='Feb 15 2009, 03:03 PM']I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800.[/quote]
I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.

I wonder why you get only up to 53 Gflop/s on CPU, which is ~70% of peak. I get up to 85% of peak with Intel MKL 10.1 on my system. I don't know if it is due to the processor or the library. Can't tell much about DDR2 speed either. I guess that chipset also matters.

Anyway, thanks for reporting the performance!
[quote name='Boxed Cylon' post='506254' date='Feb 15 2009, 03:03 PM']I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800.

I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.



I wonder why you get only up to 53 Gflop/s on CPU, which is ~70% of peak. I get up to 85% of peak with Intel MKL 10.1 on my system. I don't know if it is due to the processor or the library. Can't tell much about DDR2 speed either. I guess that chipset also matters.



Anyway, thanks for reporting the performance!

#11
Posted 02/15/2009 11:27 PM   
[quote name='vvolkov' post='506261' date='Feb 15 2009, 03:27 PM']I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.[/quote]

You are correct - some research shows that the P35 chipset is PCIe 1.1. The benchmarks for bandwidthTest below support that notion. It looks like I am underbandwidthing my GTX 260...I see a hardware upgrade in my future...

[codebox]
./bandwidthTest --memory=pinned
Running on......
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2488.1

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1821.5

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94576.6

&&&& Test PASSED
[/codebox]
[quote name='vvolkov' post='506261' date='Feb 15 2009, 03:27 PM']I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.



You are correct - some research shows that the P35 chipset is PCIe 1.1. The benchmarks for bandwidthTest below support that notion. It looks like I am underbandwidthing my GTX 260...I see a hardware upgrade in my future...



[codebox]

./bandwidthTest --memory=pinned

Running on......

device 0:GeForce GTX 260

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2488.1



Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1821.5



Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 94576.6



&&&& Test PASSED

[/codebox]

#12
Posted 02/16/2009 12:04 AM   
It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:

[codebox]
./bandwidthTest --memory=pinned
Running on......
device 0:GT200
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2657.6

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3216.0

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94604.6

&&&& Test PASSED
[/codebox]


[codebox]
> ./benchmark

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 10.89 1.19 51.91 33.91 66.81 7.97
2000 93.23 1.30 103.11 61.33 133.75 11.92
3000 99.65 1.37 147.28 93.44 163.54 15.11
4000 144.83 1.37 146.59 110.46 191.76 16.54
5000 180.16 1.73 185.05 122.42 209.02 19.93
6000 196.58 1.73 198.55 148.36 222.11 21.75
7000 204.16 1.73 206.30 164.98 228.16 22.82
8000 215.18 1.84 214.76 187.30 236.05 24.78
9000 218.61 1.82 219.08 210.76 240.26 26.26
10000 222.41 1.95 223.29 225.35 243.33 24.91
11000 228.40 1.96 227.84 265.51 247.76 28.26
[/codebox]

[codebox]
./benchmark -cpu

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 11.24 0.96 44.42 24.25 48.78 5.49
2000 29.42 1.02 38.59 53.91 47.82 7.53
3000 42.78 1.15 49.68 84.66 55.93 8.73
4000 54.96 1.13 54.84 101.21 61.57 10.20
5000 59.25 1.18 60.71 119.79 65.25 11.36
6000 60.77 1.28 63.37 138.30 66.97 12.60
7000 61.40 1.36 64.24 167.32 66.96 13.04
8000 62.72 1.26 66.55 190.67 67.95 14.13
9000 63.74 1.29 66.59 219.02 68.66 15.21
10000 64.27 1.29 67.57 241.27 69.10 15.91
11000 63.43 1.34 69.39 258.35 69.52 16.80
[/codebox]

This seems to place the 260 more in the expected place with respect to the 280.

The reviews rather beat up on the Phenoms, but for pure number crunching they seemed to have the edge over comparable Intel offerings. I can't speak to the more recent Intel offerings, but the Q6600 (2.4 GHz) was something of a lightweight when I asked all four cores to compute at once. The Phenom 9600 (2.3 GHz) scaled far better.
It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:



[codebox]

./bandwidthTest --memory=pinned

Running on......

device 0:GT200

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2657.6



Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3216.0



Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 94604.6



&&&& Test PASSED

[/codebox]





[codebox]

> ./benchmark



Device: GT200, 1296 MHz clock, 895 MB memory.



Errors reported are 1-norms of the residual such as ||A-QR||_1.



Cholesky LU QR

-------------- -------------- --------------

N Gflop/s error Gflop/s error Gflop/s error

----- -------------- -------------- --------------

1000 10.89 1.19 51.91 33.91 66.81 7.97

2000 93.23 1.30 103.11 61.33 133.75 11.92

3000 99.65 1.37 147.28 93.44 163.54 15.11

4000 144.83 1.37 146.59 110.46 191.76 16.54

5000 180.16 1.73 185.05 122.42 209.02 19.93

6000 196.58 1.73 198.55 148.36 222.11 21.75

7000 204.16 1.73 206.30 164.98 228.16 22.82

8000 215.18 1.84 214.76 187.30 236.05 24.78

9000 218.61 1.82 219.08 210.76 240.26 26.26

10000 222.41 1.95 223.29 225.35 243.33 24.91

11000 228.40 1.96 227.84 265.51 247.76 28.26

[/codebox]



[codebox]

./benchmark -cpu



Device: GT200, 1296 MHz clock, 895 MB memory.



Errors reported are 1-norms of the residual such as ||A-QR||_1.



Cholesky LU QR

-------------- -------------- --------------

N Gflop/s error Gflop/s error Gflop/s error

----- -------------- -------------- --------------

1000 11.24 0.96 44.42 24.25 48.78 5.49

2000 29.42 1.02 38.59 53.91 47.82 7.53

3000 42.78 1.15 49.68 84.66 55.93 8.73

4000 54.96 1.13 54.84 101.21 61.57 10.20

5000 59.25 1.18 60.71 119.79 65.25 11.36

6000 60.77 1.28 63.37 138.30 66.97 12.60

7000 61.40 1.36 64.24 167.32 66.96 13.04

8000 62.72 1.26 66.55 190.67 67.95 14.13

9000 63.74 1.29 66.59 219.02 68.66 15.21

10000 64.27 1.29 67.57 241.27 69.10 15.91

11000 63.43 1.34 69.39 258.35 69.52 16.80

[/codebox]



This seems to place the 260 more in the expected place with respect to the 280.



The reviews rather beat up on the Phenoms, but for pure number crunching they seemed to have the edge over comparable Intel offerings. I can't speak to the more recent Intel offerings, but the Q6600 (2.4 GHz) was something of a lightweight when I asked all four cores to compute at once. The Phenom 9600 (2.3 GHz) scaled far better.

#13
Posted 02/17/2009 01:32 PM   
[quote name='Boxed Cylon' post='506924' date='Feb 17 2009, 05:32 AM']It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:[/quote]
I wonder why your PCIe 2.0 is so slow. Here are my numbers for comparison:
[codebox]bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce GTX 280
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5582.3

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5426.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 114908.7

&&&& Test PASSED

Press ENTER to exit...
[/codebox]
Here are my numbers on PCIe 1.1 system:
[codebox]bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce GTX 280
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3054.5

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3192.1

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 114682.8

&&&& Test PASSED

Press ENTER to exit...
[/codebox]
You can see that your PCIe 2.0 runs about as fast as my PCIe 1.1 and much slower than my PCIe 2.0. I use Alienware desktops with nForce 790i Ultra SLI and nForce 680i SLI chipsets.

I also has noticed that your device is recognized as GT200. That happened with me when I was using the now ancient 177.11 drivers. I don't think this may be a performance issue, but I'd double check.
[quote name='Boxed Cylon' post='506924' date='Feb 17 2009, 05:32 AM']It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:

I wonder why your PCIe 2.0 is so slow. Here are my numbers for comparison:

[codebox]bandwidthTest.exe --memory=pinned



Running on......

device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5582.3



Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5426.2



Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114908.7



&&&& Test PASSED



Press ENTER to exit...

[/codebox]

Here are my numbers on PCIe 1.1 system:

[codebox]bandwidthTest.exe --memory=pinned



Running on......

device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3054.5



Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3192.1



Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114682.8



&&&& Test PASSED



Press ENTER to exit...

[/codebox]

You can see that your PCIe 2.0 runs about as fast as my PCIe 1.1 and much slower than my PCIe 2.0. I use Alienware desktops with nForce 790i Ultra SLI and nForce 680i SLI chipsets.



I also has noticed that your device is recognized as GT200. That happened with me when I was using the now ancient 177.11 drivers. I don't think this may be a performance issue, but I'd double check.

#14
Posted 02/17/2009 02:34 PM   
The short answer as to why I get sub-standard bandwidth is I don't know. I've tried the 180.22 and 180.29 drivers with the same result - linux does not have 181.20 as yet that I know of. Both drivers report the generic ""GT200". I've checked the bios settings and found nothing. And I know the card is in the 16X slot rather than the 8X slot. I'd have to suspect the linux drivers are lagging to some extent. If I sort out the issue, I'll post again.
The short answer as to why I get sub-standard bandwidth is I don't know. I've tried the 180.22 and 180.29 drivers with the same result - linux does not have 181.20 as yet that I know of. Both drivers report the generic ""GT200". I've checked the bios settings and found nothing. And I know the card is in the 16X slot rather than the 8X slot. I'd have to suspect the linux drivers are lagging to some extent. If I sort out the issue, I'll post again.

#15
Posted 02/17/2009 04:00 PM   
  1 / 7    
Scroll To Top