Sparse Matrix-Vector Multiplication on CUDA
  1 / 6    
I'm pleased to announce the release of our tech report "Efficient Sparse Matrix-Vector Multiplication on CUDA". It can be challenging to implement sparse matrix operations efficiently, so I hope this report offers some guidance to those working on iterative solvers and other areas where sparse matrix-vector products arise. Feel free to follow up with comments/questions about the report.


[b]Update: A collection of test matrices is [url="http://www.nvidia.com/object/nvidia_research_pub_001.html"]now available[/url][/b]
[b]Update: Supercomputing '09 SpMV paper is [url="http://www.nvidia.com/object/nvidia_research_pub_013.html"]now available[/url][/b]
[b]Update: New and improved source code is [url="http://code.google.com/p/cusp-library/downloads/list"]now available[/url][/b] (Requires CUDA 2.2 or newer)
[color="#FF0000"][b]Update: Cusp sparse matrix library has been [url="http://code.google.com/p/cusp-library/"]released[/url][/b][/color] (Requires CUDA 3.0 or newer)

[b]Abstract:[/b]
[quote]The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many
high-performance computing applications. While dense linear algebra readily maps to such platforms,
harnessing this potential for sparse matrix computations presents additional challenges. Given its role
in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector
multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented
on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound
nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider
a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular
matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to
exploit several common forms of matrix structure while offering alternatives which accommodate greater
irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and
16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices,
we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision
respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on
conventional multicore processors. Our double precision SpMV performance is generally two and a half
times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel
Clovertown system.[/quote]
I'm pleased to announce the release of our tech report "Efficient Sparse Matrix-Vector Multiplication on CUDA". It can be challenging to implement sparse matrix operations efficiently, so I hope this report offers some guidance to those working on iterative solvers and other areas where sparse matrix-vector products arise. Feel free to follow up with comments/questions about the report.





Update: A collection of test matrices is now available

Update: Supercomputing '09 SpMV paper is now available

Update: New and improved source code is now available (Requires CUDA 2.2 or newer)

Update: Cusp sparse matrix library has been released (Requires CUDA 3.0 or newer)



Abstract:

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many

high-performance computing applications. While dense linear algebra readily maps to such platforms,

harnessing this potential for sparse matrix computations presents additional challenges. Given its role

in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector

multiplication (SpMV) is of singular importance in sparse linear algebra.



In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented

on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound

nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider

a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular

matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to

exploit several common forms of matrix structure while offering alternatives which accommodate greater

irregularity.



On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and

16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices,

we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision

respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on

conventional multicore processors. Our double precision SpMV performance is generally two and a half

times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel

Clovertown system.
Attachments

nvr_2008_004.pdf

#1
Posted 12/10/2008 04:57 PM   
[quote name='nbell' post='474999' date='Dec 10 2008, 05:57 PM']I'm pleased to announce the release of our tech report "Efficient Sparse Matrix-Vector Multiplication on CUDA". It can be challenging to implement sparse matrix operations efficiently, so I hope this report offers some guidance to those working on iterative solvers and other areas where sparse matrix-vector products arise. I don't have the source code available today, but we'll definitely release that in the near future. Feel free to follow up with comments/questions about the report.

[b]Abstract:[/b][/quote]
Very interesting..
Since I am interested on it can you give an rough estimate of when the code will be avaiable (say some days, a week,a month..)? thanks!
[quote name='nbell' post='474999' date='Dec 10 2008, 05:57 PM']I'm pleased to announce the release of our tech report "Efficient Sparse Matrix-Vector Multiplication on CUDA". It can be challenging to implement sparse matrix operations efficiently, so I hope this report offers some guidance to those working on iterative solvers and other areas where sparse matrix-vector products arise. I don't have the source code available today, but we'll definitely release that in the near future. Feel free to follow up with comments/questions about the report.



Abstract:

Very interesting..

Since I am interested on it can you give an rough estimate of when the code will be avaiable (say some days, a week,a month..)? thanks!

#2
Posted 12/11/2008 03:47 AM   
This is very useful. Thanks a lot guys.

is it possible to point out where i can get the data for the test matrices ?

also when reporting the final results, did you guys use texture cache to read x or without texture cache ?
This is very useful. Thanks a lot guys.



is it possible to point out where i can get the data for the test matrices ?



also when reporting the final results, did you guys use texture cache to read x or without texture cache ?

#3
Posted 12/11/2008 10:35 AM   
[quote name='oscarb' post='475254' date='Dec 10 2008, 10:47 PM']Very interesting..
Since I am interested on it can you give an rough estimate of when the code will be avaiable (say some days, a week,a month..)? thanks![/quote]

I would estimate one or two weeks.
[quote name='oscarb' post='475254' date='Dec 10 2008, 10:47 PM']Very interesting..

Since I am interested on it can you give an rough estimate of when the code will be avaiable (say some days, a week,a month..)? thanks!



I would estimate one or two weeks.

#4
Posted 12/11/2008 01:13 PM   
[quote name='maringanti' post='475353' date='Dec 11 2008, 05:35 AM']This is very useful. Thanks a lot guys.

is it possible to point out where i can get the data for the test matrices ?[/quote]

Some of the matrices are online here:
[url="http://www.cis.ufl.edu/research/sparse/matrices/Boeing/pwtk.html"]http://www.cis.ufl.edu/research/sparse/mat...oeing/pwtk.html[/url] (Wind Tunnel)
[url="http://www.cis.ufl.edu/research/sparse/matrices/Mittelmann/rail4284.html"]http://www.cis.ufl.edu/research/sparse/mat...n/rail4284.html[/url] (LP)

The full set of matrices is over 200MB in compressed format. I'll find a way to put them online.

[quote]also when reporting the final results, did you guys use texture cache to read x or without texture cache ?[/quote]

Yes, we do use the texture cache in the comparison to the multicore systems.
[quote name='maringanti' post='475353' date='Dec 11 2008, 05:35 AM']This is very useful. Thanks a lot guys.



is it possible to point out where i can get the data for the test matrices ?



Some of the matrices are online here:

http://www.cis.ufl.edu/research/sparse/mat...oeing/pwtk.html (Wind Tunnel)

http://www.cis.ufl.edu/research/sparse/mat...n/rail4284.html (LP)



The full set of matrices is over 200MB in compressed format. I'll find a way to put them online.



also when reporting the final results, did you guys use texture cache to read x or without texture cache ?




Yes, we do use the texture cache in the comparison to the multicore systems.

#5
Posted 12/11/2008 01:23 PM   
[quote name='nbell' post='475399' date='Dec 11 2008, 06:53 PM']Some of the matrices are online here:
[url="http://www.cis.ufl.edu/research/sparse/matrices/Boeing/pwtk.html"]http://www.cis.ufl.edu/research/sparse/mat...oeing/pwtk.html[/url] (Wind Tunnel)
[url="http://www.cis.ufl.edu/research/sparse/matrices/Mittelmann/rail4284.html"]http://www.cis.ufl.edu/research/sparse/mat...n/rail4284.html[/url] (LP)[/quote]
Thanks
[quote]The full set of matrices is over 200MB in compressed format. I'll find a way to put them online.[/quote]
If these matrices are already available on UF Sparse Matrix page or on the Matrix Market page, you can just point them out.
[quote name='nbell' post='475399' date='Dec 11 2008, 06:53 PM']Some of the matrices are online here:

http://www.cis.ufl.edu/research/sparse/mat...oeing/pwtk.html (Wind Tunnel)

http://www.cis.ufl.edu/research/sparse/mat...n/rail4284.html (LP)

Thanks

The full set of matrices is over 200MB in compressed format. I'll find a way to put them online.


If these matrices are already available on UF Sparse Matrix page or on the Matrix Market page, you can just point them out.

#6
Posted 12/11/2008 02:14 PM   
[quote name='maringanti' post='475417' date='Dec 11 2008, 09:14 AM']Thanks

If these matrices are already available on UF Sparse Matrix page or on the Matrix Market page, you can just point them out.[/quote]

All the matrices are available here:
[url="http://bebop.cs.berkeley.edu/matrices/multicore-subset/"]http://bebop.cs.berkeley.edu/matrices/multicore-subset/[/url]

Those files use the Harwell-Boeing format [1]. You can use the BeBOP Sparse Matrix Converter [2] to convert them to another format. Our code (which I will post soon!) uses the MatrixMarket file format [3], so I'll put MM versions of the files online when the code has been released.

[1] [url="http://math.nist.gov/MatrixMarket/formats.html#hb"]http://math.nist.gov/MatrixMarket/formats.html#hb[/url]
[2] [url="http://bebop.cs.berkeley.edu/smc/"]http://bebop.cs.berkeley.edu/smc/[/url]
[3] [url="http://math.nist.gov/MatrixMarket/formats.html#MMformat"]http://math.nist.gov/MatrixMarket/formats.html#MMformat[/url]
[quote name='maringanti' post='475417' date='Dec 11 2008, 09:14 AM']Thanks



If these matrices are already available on UF Sparse Matrix page or on the Matrix Market page, you can just point them out.



All the matrices are available here:

http://bebop.cs.berkeley.edu/matrices/multicore-subset/



Those files use the Harwell-Boeing format [1]. You can use the BeBOP Sparse Matrix Converter [2] to convert them to another format. Our code (which I will post soon!) uses the MatrixMarket file format [3], so I'll put MM versions of the files online when the code has been released.



[1] http://math.nist.gov/MatrixMarket/formats.html#hb

[2] http://bebop.cs.berkeley.edu/smc/

[3] http://math.nist.gov/MatrixMarket/formats.html#MMformat

#7
Posted 12/12/2008 03:21 PM   
nbell, terrific work! Will this be made available in future versions of the CUDA SDK or in cudpp?

How does your approach compare to block-CSR as suggested by Buatois et al., [url="http://alice.loria.fr/publications/papers/2008/CNC/Buatois_et_al_CNC.pdf"]http://alice.loria.fr/publications/papers/...s_et_al_CNC.pdf[/url] ? I haven't found any comparison in your otherwise brilliant paper.

dom
nbell, terrific work! Will this be made available in future versions of the CUDA SDK or in cudpp?



How does your approach compare to block-CSR as suggested by Buatois et al., http://alice.loria.fr/publications/papers/...s_et_al_CNC.pdf ? I haven't found any comparison in your otherwise brilliant paper.



dom

#8
Posted 12/13/2008 01:03 AM   
@nbell, thanks for the links.
@nbell, thanks for the links.

#9
Posted 12/13/2008 09:43 AM   
[quote name='Dominik G�ddeke' post='476091' date='Dec 12 2008, 08:03 PM']nbell, terrific work! Will this be made available in future versions of the CUDA SDK or in cudpp?

How does your approach compare to block-CSR as suggested by Buatois et al., [url="http://alice.loria.fr/publications/papers/2008/CNC/Buatois_et_al_CNC.pdf"]http://alice.loria.fr/publications/papers/...s_et_al_CNC.pdf[/url] ? I haven't found any comparison in your otherwise brilliant paper.

dom[/quote]

Hi Dominik,

A (possibly different) version of this code should appear in the CUDA 2.2 SDK. However, we'll release the complete source code in the next week or so here.

Unfortunately, I don't have any direct comparisons to the CNC work. Blocking is often a very useful optimization as it improves locality and reuse of the "source" vector (i.e. x in y = A*x). This is nearly always a win when the matrix has a natural blocksize, such as a vector-valued problem with regular KxK subblocks. The CNC paper uses blocking even though the underlying problem is scalar-valued. In their application (mesh-based matrices) this proves to be advantageous because the subblocks have a reasonable number of nonzeros due to the ordering of rows and columns. However, there are many problems that do not benefit from blocking, so formats like block-CSR aren't always beneficial.

We'd definitely like to explore this area though. I think it's likely that different strategies are needed to handle different block sizes optimally. For example, the best method for 2x2 blocks will probably be different than the best method for 6x6 blocks, since you run into a different set of constraints.
[quote name='Dominik G�ddeke' post='476091' date='Dec 12 2008, 08:03 PM']nbell, terrific work! Will this be made available in future versions of the CUDA SDK or in cudpp?



How does your approach compare to block-CSR as suggested by Buatois et al., http://alice.loria.fr/publications/papers/...s_et_al_CNC.pdf ? I haven't found any comparison in your otherwise brilliant paper.



dom



Hi Dominik,



A (possibly different) version of this code should appear in the CUDA 2.2 SDK. However, we'll release the complete source code in the next week or so here.



Unfortunately, I don't have any direct comparisons to the CNC work. Blocking is often a very useful optimization as it improves locality and reuse of the "source" vector (i.e. x in y = A*x). This is nearly always a win when the matrix has a natural blocksize, such as a vector-valued problem with regular KxK subblocks. The CNC paper uses blocking even though the underlying problem is scalar-valued. In their application (mesh-based matrices) this proves to be advantageous because the subblocks have a reasonable number of nonzeros due to the ordering of rows and columns. However, there are many problems that do not benefit from blocking, so formats like block-CSR aren't always beneficial.



We'd definitely like to explore this area though. I think it's likely that different strategies are needed to handle different block sizes optimally. For example, the best method for 2x2 blocks will probably be different than the best method for 6x6 blocks, since you run into a different set of constraints.

#10
Posted 12/15/2008 04:42 PM   
Hi nbell,

I agree, blocking is not always beneficial (or better, that the optimal blocking technique is application-dependent). I've toyed around a bit with renumbering strategies like Cuthill-McKee and the like, and for our FEM-CFD matrices, this helps at creating a reasonable number of nonzero values per average block.

Anyway, having code available for general SpMV is definitely one of the missing links between CUDA and (some) real-world applications, so congrats again!

dom
Hi nbell,



I agree, blocking is not always beneficial (or better, that the optimal blocking technique is application-dependent). I've toyed around a bit with renumbering strategies like Cuthill-McKee and the like, and for our FEM-CFD matrices, this helps at creating a reasonable number of nonzero values per average block.



Anyway, having code available for general SpMV is definitely one of the missing links between CUDA and (some) real-world applications, so congrats again!



dom

#11
Posted 12/15/2008 09:29 PM   
Yea, The Links wre useful to me too, Waiting for the CUDA version of it.
Yea, The Links wre useful to me too, Waiting for the CUDA version of it.

#12
Posted 12/16/2008 04:19 AM   
[quote name='dlmeetei' post='477274' date='Dec 15 2008, 11:19 PM']Yea, The Links wre useful to me too, Waiting for the CUDA version of it.[/quote]

I've attached a zip file containing the source code to the original post. The included README.txt should provide enough info to get you started. Please let me know if you have any trouble with the code!

Once I get the complete set of matrices online I'll add another link.
[quote name='dlmeetei' post='477274' date='Dec 15 2008, 11:19 PM']Yea, The Links wre useful to me too, Waiting for the CUDA version of it.



I've attached a zip file containing the source code to the original post. The included README.txt should provide enough info to get you started. Please let me know if you have any trouble with the code!



Once I get the complete set of matrices online I'll add another link.

#13
Posted 12/19/2008 12:46 AM   
Dear nbell,
Is it best to use your method for calculating sparse matrix matrix multiplication?
Dear nbell,

Is it best to use your method for calculating sparse matrix matrix multiplication?

#14
Posted 12/19/2008 06:57 AM   
Hi,

Thank you very much Nathan for making your code available,

We are excited to see this technology coming out,

About blocking, there is a possible strategy (that we did not include yet in the Concurrent Number Cruncher):

One can construct the sparsity pattern for different blocking sizes, and evaluate the best block size to use based
on the filling ratios. Since constructing the sparsity pattern costs less than constructing the whole sparse matrix,
several block sizes can be reasonably tryed. I have read a while ago a paper that used this strategy (I do not
remember where, I will post the reference if I find it)

Regards,
-- Bruno Levy
Hi,



Thank you very much Nathan for making your code available,



We are excited to see this technology coming out,



About blocking, there is a possible strategy (that we did not include yet in the Concurrent Number Cruncher):



One can construct the sparsity pattern for different blocking sizes, and evaluate the best block size to use based

on the filling ratios. Since constructing the sparsity pattern costs less than constructing the whole sparse matrix,

several block sizes can be reasonably tryed. I have read a while ago a paper that used this strategy (I do not

remember where, I will post the reference if I find it)



Regards,

-- Bruno Levy

#15
Posted 12/19/2008 02:33 PM   
  1 / 6    
Scroll To Top