Modern file compression

Unknown to most users, file compression silently works behind the scene. Updates for any operating system, for example, are compressed. That happens automatically and the user doesn't even need to know about it.

But sometimes, we have a choice. In Archlinux, for example, we can set the compression we'd like to use for packages created by makepkg (such as those installed over the AUR) – but how to chose between gz, bz2, xz, lrz, lzo, and z? And some backup software adds further options: Borg, for example, offers zlib, lzma, lz4, and zstd.

Most surprisingly, some of these algorithms have been developed only very recently: zstd comes from Facebook (2016), and there's brotli from Google (2015) and lzfse from Apple (2015). Why do these multi-billion-dollar companies develop compression algorithms? Because of the multi-billion dollars.

Instead of testing each of these algorithms yourself, you can use lzbench. It tests all open source algorithms of the lz family with the de facto standard file package in the compression business, the silesia suite.

Here are three examples geared toward high compression ratio, high speed compression, and high speed decompression:

High compression ratio (<25%)

➜  lzbench -c -ebrotli,11/xz,6,9/zstd,22 silesia.tar
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9814 MB/s  9852 MB/s   211947520 100.00
brotli 2017-12-12 -11    0.48 MB/s   385 MB/s    51136654  24.13
xz 5.2.3 -6              2.30 MB/s    74 MB/s    48745306  23.00
zstd 1.3.3 -22           2.30 MB/s   600 MB/s    52845025  24.93

These are single core values. xz compression (but not decompression) profits from multithreading, while brotli and zstd do not.

High speed compression (for compression ratios <50%)

➜  lzbench -c -elz4/lzo1x silesia.tar
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9861 MB/s  9768 MB/s   211947520 100.00
lz4 1.8.0                 524 MB/s  2403 MB/s   100880800  47.60
lzo1x 2.09 -12            521 MB/s   738 MB/s   103238859  48.71

High speed decompression (> 2000 MB/s)

↪ lzbench -c -elz4/lizard,10/lzsse8,6 silesia.tar
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9579 MB/s 10185 MB/s   211947520 100.00
lz4 1.8.0                 525 MB/s  2421 MB/s   100880800  47.60
lizard 1.0 -10            421 MB/s  2115 MB/s   103402971  48.79
lzsse8 2016-05-14 -6     8.25 MB/s  3359 MB/s    75469717  35.61

What do we learn from these benchmarks?

  1. If we want high compression reasonably fast, nothing beats xz. It's just perfect for what it's actually used by some (all?) Linux distributions: to distribute updates with acceptable computational resources over a channel with a very limited band width.

  2. If the distributor commands over virtually unlimited resources, and compression speed is thus not an issue, brotli and zstd are clearly superior to all other choices. That's how we would like to have our updates: small and fast to decompress.

  3. If size is not of primary importance, but compression speed is, lz4 and lzo are the champions.

  4. If decompression speed is essential, lzsse8 wins. This is a lesser known member of the lz family and not widely available, in contrast to lz4 which thus scores again.