LU, QR and Cholesky factorizations using GPU

I’d like to share an implementation of LAPACK’s routines SGETRF, SPOTRF, and SGEQRF that is accelerated using GPU. This implementation is limited to factorization of square matrices that reside in the host memory (i.e. at the CPU side). The following figure shows the sustained performance on the following platform: Intel Core2 Quad 2.83 GHz (Q9550), PCIe 2.0 x16, Intel MKL 10.1, Windows XP 64-bit, NVIDIA driver 181.20, CUDA 2.1:

External Media

The implementation follows the description given in the following paper; however, some of the finer tunings described, such as recursive and variable blocking, are not included in this release:[indent]Volkov, V., and Demmel, J. W. 2008. Benchmarking GPUs to tune dense linear algebra, SC08.[/indent]Regards,

Vasily

05/02/09 edit: updated dead URL to the paper.

Thank you very much!
For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.
Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.
But I did not find anyone have ever measure which method is better on GPU.

In fact, I am working on a Givens rotation version of QR decomposition.
Maybe we can compare whose solution is faster : )

I use block Householder update as done in LAPACK. It is BLAS3, so runs as fast as GEMM does. I wonder if you can do better.

Vasily

Many thanks! You could make and lay out too most for Double.

With routines such as these we are ever so close to having functional “sgetrs” which calls on the existing “strsm” and the

simple, but not yet existing “slaswp”. The combination sgetrf and sgetrs solves the equation Ax=b for x, i.e., x=A\b. This being

a holy grail at the moment.

I have hardware one step below the Q9550/gtx 280: a Q6600 quadcore cpu and a gtx 260. I get the following:

[codebox]

…glapack> ./benchmark

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR

     --------------   --------------   --------------

N Gflop/s error Gflop/s error Gflop/s error


1000 14.83 0.80 42.96 34.48 54.31 8.78

2000 101.17 1.07 97.62 60.93 123.00 12.67

3000 140.38 1.21 130.77 80.04 150.68 13.79

4000 111.16 0.94 101.29 106.74 168.95 16.81

5000 174.11 1.53 154.04 124.38 188.27 17.73

6000 172.13 1.43 173.10 146.37 196.90 20.60

7000 180.64 1.68 173.76 159.69 202.71 21.18

8000 190.27 1.61 180.69 193.38 207.50 22.29

9000 194.35 1.50 187.24 206.19 212.15 25.96

10000 198.41 1.67 192.23 225.67 215.90 27.75

11000 199.69 1.78 194.05 238.32 220.92 26.88

[/codebox]

I am somewhat stunned that the 260 is only about 2/3 as fast as the 280 for this benchmark. Perhaps it is the cpu/gpu combination that is conspiring to be slower? I have 8 GB of slowish ram in my system, preferring lots of ram over fast ram. Perhaps the code has some special tuning for the 280?

[codebox]

… glapack> ./benchmark -cpu

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR

     --------------   --------------   --------------

N Gflop/s error Gflop/s error Gflop/s error


1000 12.95 0.87 32.01 24.60 39.90 6.47

2000 32.06 0.97 36.39 53.54 51.76 6.71

3000 38.37 0.90 44.59 81.45 47.21 9.00

4000 48.96 0.85 45.72 98.10 49.07 7.62

5000 47.45 1.11 42.56 125.48 50.32 11.28

6000 46.80 1.21 42.53 155.80 51.31 10.47

7000 46.76 1.17 51.04 166.25 51.59 13.42

8000 40.01 1.19 52.32 197.28 52.47 14.37

9000 48.41 1.18 43.29 223.64 52.66 13.83

10000 48.89 1.21 53.09 244.25 42.80 16.26

11000 51.22 1.18 43.80 265.48 52.91 16.33

12000 50.13 1.23 43.68 300.44 43.16 17.32

13000 40.73 1.20 43.53 300.52 43.38 19.32

14000 40.94 1.22 44.06 335.17 43.21 19.20

15000 41.40 1.32 43.36 346.97 42.79 18.02

[/codebox]

I’ve toyed with upgrading to a Q9550 but I am not sure it is worth the $300 it would take… I paid $400 for my gtx 260 last June which brings tears to my eyes now…

As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.

Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?

vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.

Here are the time results for QR on 8800GTX:

n			 1000	2000   3000   4000  5000  6000  7000  8000  9000  10000  11000  12000  13000

seconds	 0.0194  0.0918  0.256  0.566  1.05  1.74  2.71  3.94  5.56   7.55   9.92   12.8   16.3

I used formula: Gflop/s rate = 4e-9nn*n/3/seconds.

I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800. I run the cpu at normal speed. I ran this benchmark using the latest 180.29 version of the nvidia driver, and I ran the benchmark with X turned off with “init 3”. I think that may be all the relevant information…

Thanks for the factorizations!

I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.

I wonder why you get only up to 53 Gflop/s on CPU, which is ~70% of peak. I get up to 85% of peak with Intel MKL 10.1 on my system. I don’t know if it is due to the processor or the library. Can’t tell much about DDR2 speed either. I guess that chipset also matters.

Anyway, thanks for reporting the performance!

You are correct - some research shows that the P35 chipset is PCIe 1.1. The benchmarks for bandwidthTest below support that notion. It looks like I am underbandwidthing my GTX 260…I see a hardware upgrade in my future…

[codebox]

./bandwidthTest --memory=pinned

Running on…

  device 0:GeForce GTX 260

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2488.1

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1821.5

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 94576.6

&&&& Test PASSED

[/codebox]

It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:

[codebox]

./bandwidthTest --memory=pinned

Running on…

  device 0:GT200

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2657.6

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3216.0

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 94604.6

&&&& Test PASSED

[/codebox]

[codebox]

./benchmark

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR

     --------------   --------------   --------------

N Gflop/s error Gflop/s error Gflop/s error


1000 10.89 1.19 51.91 33.91 66.81 7.97

2000 93.23 1.30 103.11 61.33 133.75 11.92

3000 99.65 1.37 147.28 93.44 163.54 15.11

4000 144.83 1.37 146.59 110.46 191.76 16.54

5000 180.16 1.73 185.05 122.42 209.02 19.93

6000 196.58 1.73 198.55 148.36 222.11 21.75

7000 204.16 1.73 206.30 164.98 228.16 22.82

8000 215.18 1.84 214.76 187.30 236.05 24.78

9000 218.61 1.82 219.08 210.76 240.26 26.26

10000 222.41 1.95 223.29 225.35 243.33 24.91

11000 228.40 1.96 227.84 265.51 247.76 28.26

[/codebox]

[codebox]

./benchmark -cpu

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR

     --------------   --------------   --------------

N Gflop/s error Gflop/s error Gflop/s error


1000 11.24 0.96 44.42 24.25 48.78 5.49

2000 29.42 1.02 38.59 53.91 47.82 7.53

3000 42.78 1.15 49.68 84.66 55.93 8.73

4000 54.96 1.13 54.84 101.21 61.57 10.20

5000 59.25 1.18 60.71 119.79 65.25 11.36

6000 60.77 1.28 63.37 138.30 66.97 12.60

7000 61.40 1.36 64.24 167.32 66.96 13.04

8000 62.72 1.26 66.55 190.67 67.95 14.13

9000 63.74 1.29 66.59 219.02 68.66 15.21

10000 64.27 1.29 67.57 241.27 69.10 15.91

11000 63.43 1.34 69.39 258.35 69.52 16.80

[/codebox]

This seems to place the 260 more in the expected place with respect to the 280.

The reviews rather beat up on the Phenoms, but for pure number crunching they seemed to have the edge over comparable Intel offerings. I can’t speak to the more recent Intel offerings, but the Q6600 (2.4 GHz) was something of a lightweight when I asked all four cores to compute at once. The Phenom 9600 (2.3 GHz) scaled far better.

I wonder why your PCIe 2.0 is so slow. Here are my numbers for comparison:

[codebox]bandwidthTest.exe --memory=pinned

Running on…

  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5582.3

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5426.2

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114908.7

&&&& Test PASSED

Press ENTER to exit…

[/codebox]

Here are my numbers on PCIe 1.1 system:

[codebox]bandwidthTest.exe --memory=pinned

Running on…

  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3054.5

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3192.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114682.8

&&&& Test PASSED

Press ENTER to exit…

[/codebox]

You can see that your PCIe 2.0 runs about as fast as my PCIe 1.1 and much slower than my PCIe 2.0. I use Alienware desktops with nForce 790i Ultra SLI and nForce 680i SLI chipsets.

I also has noticed that your device is recognized as GT200. That happened with me when I was using the now ancient 177.11 drivers. I don’t think this may be a performance issue, but I’d double check.

The short answer as to why I get sub-standard bandwidth is I don’t know. I’ve tried the 180.22 and 180.29 drivers with the same result - linux does not have 181.20 as yet that I know of. Both drivers report the generic "“GT200”. I’ve checked the bios settings and found nothing. And I know the card is in the 16X slot rather than the 8X slot. I’d have to suspect the linux drivers are lagging to some extent. If I sort out the issue, I’ll post again.

Ah ha! It turns out that on this Gigabyte motherboard, if I hit Cntrl-F1 when in the BIOS I can get at some additional options for PCIe. They were all set to “disabled” and I set them to “auto” - exactly what the settings are, I could not say. But the effect is to boost the bandwidth up to the expected level:

[codebox]

./bandwidthTest --memory=pinned

Running on…

  device 0:GeForce GTX 260

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5280.0

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5290.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 94576.6

[/codebox]

The new results for the glapack test are:

[codebox]

./benchmark

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR

     --------------   --------------   --------------

N Gflop/s error Gflop/s error Gflop/s error


1000 2.01 1.19 58.71 35.05 70.78 8.29

2000 103.33 1.33 117.06 58.29 113.79 12.38

3000 125.02 1.35 166.93 83.37 182.32 14.91

4000 158.30 1.47 160.73 112.14 201.49 16.90

5000 195.72 1.64 203.88 125.93 218.48 19.45

6000 213.32 1.69 216.52 151.77 230.96 21.74

7000 220.10 1.86 222.57 177.18 236.10 22.58

8000 222.84 1.82 230.17 160.59 243.12 24.23

9000 231.44 1.82 232.00 216.55 247.03 27.03

10000 236.08 1.90 236.43 222.46 249.57 28.09

11000 241.31 2.02 240.32 254.06 252.09 28.49

[/codebox]

It looks like doubling the bandwidth in this case boosted the benchmark numbers by 5% or so. I think I my system is tuned up now.

Cool! Thanks for getting better performance numbers with my code! :-D

Hi, great work here!

Do you think there is a way to implement sparse Cholesky factorization in CUDA?

Cholmod is so efficient on the CPU that it makes me dream about having it ported on the GPU.

I think sparse codes need fast communication between thread blocks/multiprocessors, which is currently lacking.

I’m curious to hear what you’d consider fast in this case (or if you want, you could just email me).