Data Compression on GeForce
Hi,

I know that compression is not a good candidate for parallel programming.

But in my case, I need to compress chunks of 32KB data blocks. Each data block
is unrelated and should be compressed separately.

There are performance limits on doing this on a regular CPU. Is the GPU worth evaluating.??

Secondly, which compression algorithms should be good in this case? LZ77, LZW etc
or maybe even 'burrow wheeler transform' followed by entropy encoding.

If compression is a good candidate, is it a good idea to split work among 32 threads of a warp.
by this i mean, 30 threads fetch data into the shared memory, and 2 threads do the actual compression!!
(Atleast techniques like this could improve memory latency issues??!)

Thanks,
Hi,



I know that compression is not a good candidate for parallel programming.



But in my case, I need to compress chunks of 32KB data blocks. Each data block

is unrelated and should be compressed separately.



There are performance limits on doing this on a regular CPU. Is the GPU worth evaluating.??



Secondly, which compression algorithms should be good in this case? LZ77, LZW etc

or maybe even 'burrow wheeler transform' followed by entropy encoding.



If compression is a good candidate, is it a good idea to split work among 32 threads of a warp.

by this i mean, 30 threads fetch data into the shared memory, and 2 threads do the actual compression!!

(Atleast techniques like this could improve memory latency issues??!)



Thanks,

#1
Posted 07/25/2009 03:52 PM   
[quote name='vikram' post='569858' date='Jul 25 2009, 05:52 PM']But in my case, I need to compress chunks of 32KB data blocks. Each data block
is unrelated and should be compressed separately.[/quote]

You might want to start with a simple algorithm, for example with a static huffman compression (for a known distribution of input symbol probabilities). The difficult part is how to generate a single bit stream from many parallel threads.

A first CUDA kernel could determine all the code words in a parallel fashion (This can be done with a simple table lookup) and a second kernel could generate a compact bit stream from this. Ideally the multiprocessors would each be assigned one 32k block, so the overall speed scales with the amount of multiprocessors processors you have.

Christian
[quote name='vikram' post='569858' date='Jul 25 2009, 05:52 PM']But in my case, I need to compress chunks of 32KB data blocks. Each data block

is unrelated and should be compressed separately.



You might want to start with a simple algorithm, for example with a static huffman compression (for a known distribution of input symbol probabilities). The difficult part is how to generate a single bit stream from many parallel threads.



A first CUDA kernel could determine all the code words in a parallel fashion (This can be done with a simple table lookup) and a second kernel could generate a compact bit stream from this. Ideally the multiprocessors would each be assigned one 32k block, so the overall speed scales with the amount of multiprocessors processors you have.



Christian

#2
Posted 07/25/2009 04:22 PM   
Investigating arithmetic coding on the GPU could be a very interesting topic as well. Taking into account that the nvidia GPUs are faster in floating point arithmetic than in 32 bit integer arithmetic, the probability intervals could be expressed as floats instead of ints.

Again, the problem remains that it will be difficult to generate a bit stream from the input data. Maybe each thread could generate its own bit stream, concatenating them all at the end. But you would also need to write an index for the decoder, describing the offsets where each thread's bit stream is located in the compressed file. Otherwise there would be no way to decode it in a parallel manner.

This seems to be a general problem with bit streams written by a serial encoder. Without an index describing the location of independently decodable blocks, a parallel decoder is almost impossible to implement efficiently. For example decoding a JPEG image on a GPU, exploiting parallelism is hard. There is almost no way to tell where the JPEG's minimum coded units (the 8x8 DCT blocks) are located in the bit stream without having decoded all previous MCUs. One could try to guess, but it wastes computation cycles whenever a wrong guess was made.

So a parallel algorithm designed for a GPU would have to include some hiting, indicating where the independently decodable units start (sync markers would work, or an index at the beginning of the file).

Christian
Investigating arithmetic coding on the GPU could be a very interesting topic as well. Taking into account that the nvidia GPUs are faster in floating point arithmetic than in 32 bit integer arithmetic, the probability intervals could be expressed as floats instead of ints.



Again, the problem remains that it will be difficult to generate a bit stream from the input data. Maybe each thread could generate its own bit stream, concatenating them all at the end. But you would also need to write an index for the decoder, describing the offsets where each thread's bit stream is located in the compressed file. Otherwise there would be no way to decode it in a parallel manner.



This seems to be a general problem with bit streams written by a serial encoder. Without an index describing the location of independently decodable blocks, a parallel decoder is almost impossible to implement efficiently. For example decoding a JPEG image on a GPU, exploiting parallelism is hard. There is almost no way to tell where the JPEG's minimum coded units (the 8x8 DCT blocks) are located in the bit stream without having decoded all previous MCUs. One could try to guess, but it wastes computation cycles whenever a wrong guess was made.



So a parallel algorithm designed for a GPU would have to include some hiting, indicating where the independently decodable units start (sync markers would work, or an index at the beginning of the file).



Christian

#3
Posted 07/25/2009 04:34 PM   
[quote name='cbuchner1' post='569870' date='Jul 25 2009, 04:34 PM']This seems to be a general problem with bit streams written by a serial encoder. Without an index describing the location of independently decodable blocks, a parallel decoder is almost impossible to implement efficiently. For example decoding a JPEG image on a GPU, exploiting parallelism is hard. There is almost no way to tell where the JPEG's minimum coded units (the 8x8 DCT blocks) are located in the bit stream without having decoded all previous MCUs. One could try to guess, but it wastes computation cycles whenever a wrong guess was made.[/quote]

Perhaps someone could propose a compression scheme that was actually amenable to decompressing on a parallel architecture.
[quote name='cbuchner1' post='569870' date='Jul 25 2009, 04:34 PM']This seems to be a general problem with bit streams written by a serial encoder. Without an index describing the location of independently decodable blocks, a parallel decoder is almost impossible to implement efficiently. For example decoding a JPEG image on a GPU, exploiting parallelism is hard. There is almost no way to tell where the JPEG's minimum coded units (the 8x8 DCT blocks) are located in the bit stream without having decoded all previous MCUs. One could try to guess, but it wastes computation cycles whenever a wrong guess was made.



Perhaps someone could propose a compression scheme that was actually amenable to decompressing on a parallel architecture.

#4
Posted 07/26/2009 12:45 AM   
32KB blocks to compress using GPU? Easy work!!!

If you use huffman compression, juste putting the Huffman table on Constant and then sending 256 blocks/ Multiprocessor, INTERLEAVED 256X to enable coalesced read, and you are done. 8MB DATA per Multiprocessor will go up to 240MB for a GTX 285 so seems any implementation will handle that, and with coalesced READS, you will use 100% of your CPU cycles.

Just think about cocatenating your output per 128bits blocks within registers to write them down not too often.
You may use the [url="http://blog.cudachess.org/2009/07/macro-threading/"]macro-threading idea[/url] to have a separated warp (or two) that will write-back the compressed data to Global Memory :-)
32KB blocks to compress using GPU? Easy work!!!



If you use huffman compression, juste putting the Huffman table on Constant and then sending 256 blocks/ Multiprocessor, INTERLEAVED 256X to enable coalesced read, and you are done. 8MB DATA per Multiprocessor will go up to 240MB for a GTX 285 so seems any implementation will handle that, and with coalesced READS, you will use 100% of your CPU cycles.



Just think about cocatenating your output per 128bits blocks within registers to write them down not too often.

You may use the macro-threading idea to have a separated warp (or two) that will write-back the compressed data to Global Memory :-)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#5
Posted 07/27/2009 03:46 PM   
[quote name='iAPX' post='570552' date='Jul 27 2009, 05:46 PM']32KB blocks to compress using GPU? Easy work!!!

If you use huffman compression, juste putting the Huffman table on Constant and then sending 256 blocks/ Multiprocessor, INTERLEAVED 256X to enable coalesced read, and you are done.[/quote]

Ah, that's a good idea to let each thread generate its very own stream instead of trying to concatenate the thread output with some sort of "stream compaction", as I had intended

I disagree on the use of constant memory though - it will exhibit suboptimal performance when different threads access different locations. Maybe a use of texture memory might be better for the code tables? The most probable codewords should be spatially close together in the texture. Using a 2D texture would allow to use 2nd order huffman coding, for example for text compression where you have certain pairs of letters that occur more frequently than others.
[quote name='iAPX' post='570552' date='Jul 27 2009, 05:46 PM']32KB blocks to compress using GPU? Easy work!!!



If you use huffman compression, juste putting the Huffman table on Constant and then sending 256 blocks/ Multiprocessor, INTERLEAVED 256X to enable coalesced read, and you are done.



Ah, that's a good idea to let each thread generate its very own stream instead of trying to concatenate the thread output with some sort of "stream compaction", as I had intended



I disagree on the use of constant memory though - it will exhibit suboptimal performance when different threads access different locations. Maybe a use of texture memory might be better for the code tables? The most probable codewords should be spatially close together in the texture. Using a 2D texture would allow to use 2nd order huffman coding, for example for text compression where you have certain pairs of letters that occur more frequently than others.

#6
Posted 07/27/2009 06:26 PM   
If the Huffman is constant (ie as Fax for example) and small, you could also put it in Shared Memory, but you will anyway have memory contention with no coalsced accesses between threads :-(
If the Huffman is constant (ie as Fax for example) and small, you could also put it in Shared Memory, but you will anyway have memory contention with no coalsced accesses between threads :-(

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#7
Posted 07/27/2009 07:14 PM   
Hello.
Somewhere you can find examples of Huffman on the GPU?
Or maybe the other compression algorithms without losing data?
Hello.

Somewhere you can find examples of Huffman on the GPU?

Or maybe the other compression algorithms without losing data?

#8
Posted 05/29/2011 07:56 PM   
I think that the right approach for compression on parallel architectures would involve a from-scratch redesign of the algorithms involved.

The core of lossless data compression is identifying redundancy in the data. This is done using a combination of clever schemes for run length encoding and huffman encoding.

For example, for the LZ family: "Many algorithms forward scan the input buffer and match it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary."

For a parallel algorithm, the basic principles would remain the same. You would still want to identify and exploit redundancy in the data. However, you would probably want to make the identification of redundant data decomposable and hierarchical. Maybe start by building up intermediate results (word/pattern/phrase frequencies) in parallel, then barrier, share the generated dictionaries, and identify redundancy in the dictionaries rather than in the input data. Repeat multiple stages of this until you get diminishing returns.

Blocking schemes will not be sufficient here, redundant information will need to be identified and shared among blocks. We really need two subproblems to be solved:

1) Building a dictionary in parallel.
2) Performing entropy encoding (alphabet generation) in parallel.

The quality of the dictionaries and alphabets generated need to be on par with existing sequential algorithms, and the algorithms need to be scalable to millions of threads. This is will be the most significant result, and clever schemes (e.g. never materializing the dictionary) will follow.

There is obviously a lot of work that needs to be done. We need an expert on this topic or someone to put in enough time to become an expert. NVIDIA, Intel, AMD, etc should step up and fund a multi-year thesis on this topic.
I think that the right approach for compression on parallel architectures would involve a from-scratch redesign of the algorithms involved.



The core of lossless data compression is identifying redundancy in the data. This is done using a combination of clever schemes for run length encoding and huffman encoding.



For example, for the LZ family: "Many algorithms forward scan the input buffer and match it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary."



For a parallel algorithm, the basic principles would remain the same. You would still want to identify and exploit redundancy in the data. However, you would probably want to make the identification of redundant data decomposable and hierarchical. Maybe start by building up intermediate results (word/pattern/phrase frequencies) in parallel, then barrier, share the generated dictionaries, and identify redundancy in the dictionaries rather than in the input data. Repeat multiple stages of this until you get diminishing returns.



Blocking schemes will not be sufficient here, redundant information will need to be identified and shared among blocks. We really need two subproblems to be solved:



1) Building a dictionary in parallel.

2) Performing entropy encoding (alphabet generation) in parallel.



The quality of the dictionaries and alphabets generated need to be on par with existing sequential algorithms, and the algorithms need to be scalable to millions of threads. This is will be the most significant result, and clever schemes (e.g. never materializing the dictionary) will follow.



There is obviously a lot of work that needs to be done. We need an expert on this topic or someone to put in enough time to become an expert. NVIDIA, Intel, AMD, etc should step up and fund a multi-year thesis on this topic.

#9
Posted 05/31/2011 06:37 PM   
[quote name='vikram' date='25 July 2009 - 03:52 PM' timestamp='1248537162' post='569858']

Secondly, which compression algorithms should be good in this case? LZ77, LZW etc
or maybe even 'burrow wheeler transform' followed by entropy encoding.

If compression is a good candidate, is it a good idea to split work among 32 threads of a warp.
by this i mean, 30 threads fetch data into the shared memory, and 2 threads do the actual compression!!
(Atleast techniques like this could improve memory latency issues??!)

Thanks,
[/quote]

From what I understand, LZW is not particularly parallelizable. And even on a fast CPU it rarely can achieve more than 500MB/s decompression speed.
I would rather look at GPU implementations of algorithms that successfully have been used in a past with superscalar CPUs such as PFOR, PFOR-DELTA
and PDICT. There is a good paper on these vectorizable algorithms : "Super-Scalar RAM-CPU Cache Compression" by Peter Boncz and others.
I implemented these algorithms in my GPU database engine projects and it seem to work well.
[quote name='vikram' date='25 July 2009 - 03:52 PM' timestamp='1248537162' post='569858']



Secondly, which compression algorithms should be good in this case? LZ77, LZW etc

or maybe even 'burrow wheeler transform' followed by entropy encoding.



If compression is a good candidate, is it a good idea to split work among 32 threads of a warp.

by this i mean, 30 threads fetch data into the shared memory, and 2 threads do the actual compression!!

(Atleast techniques like this could improve memory latency issues??!)



Thanks,





From what I understand, LZW is not particularly parallelizable. And even on a fast CPU it rarely can achieve more than 500MB/s decompression speed.

I would rather look at GPU implementations of algorithms that successfully have been used in a past with superscalar CPUs such as PFOR, PFOR-DELTA

and PDICT. There is a good paper on these vectorizable algorithms : "Super-Scalar RAM-CPU Cache Compression" by Peter Boncz and others.

I implemented these algorithms in my GPU database engine projects and it seem to work well.

#10
Posted 05/05/2012 01:39 PM   
Scroll To Top