Friday, January 14, 2011

Multi-threaded compression

 Since a few years now, most PC systems sold are using multi-cores processors. As a logical consequence, there is an increase pressure from users to create applications capable to use all these available cores, since sleeping silicon seems a waste.

Obviously, this trend also reached compression utilities. Now the question is, how all this does work ?

There are several methods, and we'll investigate the easiest one to begin with. Simple thinking make us consider splitting the source file into smaller blocks, and process these blocks in parallel. This is easy to understand and setup, and in practice works well. Very well indeed, since this generic method can be extended to any number of cores, at least as many as there are blocks.

But there is a drawback : it is not possible to use in a block information from another block to improve compression. And this is a real issue, with direct impact on compression ratio.

Fortunately, for "low-depth" compressors such as LZ4 or Zhuff, this problem is not too large. Since these compressors use the last 64KB of data to predict the next bytes, the inefficiency is clearly limited to the first 64KB of the block. Which means that, for a big enough block, this inefficiency can be made small.
But how much is "big enough" ?

I was asked this very question by Con Kolivas, working on his own lrzip compressor (only available for Linux for the time being). Since lrzip is a heavily multi-threaded implementation of 7zip on top of rzip, he wants to dispatch as many processes as possible, while keeping memory usage in check. Selecting a too large block size has its own drawback, such as consuming more memory, creating more latency before cores are loaded, and such. Therefore, we want block size to be as small as possible, but not too small, to control the inefficiency within manageable area.

Here are some results taken out from LZ4 working on enwik8 with various block sizes, expressed as a multiple of search depth :

Block Size    Inefficiency
  2X              2.4%
  4X              1.2%
  8X              0.6%
 16X              0.3%
 32X              0.15%


As can be seen, the inefficiency is reduced by 2 at each step, which is not a random effect : the "inefficient" part of the block is only the first 1X, therefore its "relative" importance is divided by 2 each time block size is multiplied by 2.

All in all, enwik8 is probably one of the worst cases (outside of specifically crafted ones) since it heavily depends on match finds, so it is a good basis to select a threshold.

In my opinion, 16X is already quite good, with a low enough inefficiency ratio.
16X for 64KB, this means 1MB block size. In a word, manageable.

No comments:

Post a Comment