Some numbers for how well various compressors do with our /var/mail backup

April 7, 2018

Recently I discussed how gzip --best wasn't very fast when compressing our Amanda (tar) backup of /var/mail, and mentioned that we were trying out zstd for this. As it happens, as part of our research on this issue I ran one particular night's backup of our /var/mail through all of the various compressors to see how large they'd come out, and I think the numbers are usefully illustrative.

The initial uncompressed tar archive is roughly 538 GB and is probably almost completely ASCII text (since we use traditional mbox format inboxes and most email is encoded to 7-bit ASCII). The compression ratios are relative to the uncompressed file, while the times are relative to the fastest compression algorithm. Byte sizes were counted with 'wc -c', instead of writing the results to disk, and I can be confident that the compression programs were the speed limit on this system, not reading the initial tar archive off SSDs.

Compression ratio Time ratio
uncompressed 1.0 0.47
lz4 1.4 1.0
gzip --fast 1.77 11.9
gzip --best 1.87 17.5
zstd -1 1.92 1.7
zstd -3 1.99 2.4

(The 'uncompressed' time is for 'cat <file> | wc -c'.)

On this very real-world test for us, zstd is clearly a winner over gzip; it achieves better compression with far less time. gzip --fast takes about 32% less time than gzip --best at only a moderate cost in compression ratio, but it's not competitive with zstd in either time or compression. Zstd is not as fast as lz4 but it's fast enough, while providing clearly better compression.

We're currently using the default zstd compression level, which is 'zstd -3' (we're just invoking plain '/usr/bin/zstd'). These numbers suggest that we'd lose very little compression from switching to 'zstd -1' but get a significant speed increase. At the moment we're going to leave things as they are because our backups are now fast enough (backing up /var/mail is now not the limiting factor on their overall speed) and we do get something for that extra time. Also, it's simpler; because of how Amanda works, we'd need to add a script to switch to 'zstd -1'.

(Amanda requires you to specify a program as your compressor, not a program plus arguments, so if you want to invoke the real compressor with some non-default options you need a cover script.)

Since someone is going to ask, pigz -fast got a compression ratio of 1.78 and a time ratio of 1.27. This is extremely unrepresentative of what we could achieve in production on our Amanda backup servers, since my test machine is a 16-coreCPU Xeon Silver 4108. The parallelism speed increase for pigz is not perfect, since it was only about 9.4 times faster than gzip --fast (which is single-core).

(Since I wanted to see the absolute best case for pigz in terms of speed, I ran it on all cores CPUs. I'm not interested in doing more tests to establish how it scales when run with fewer cores CPUs, since we're not going to use it; zstd is better for our case.)

PS: I'm not giving absolute speeds because these speeds vary tremendously across our systems and also depend on what's being compressed, even with just ASCII text.

Written on 07 April 2018.
« Using Go finalizers can be a better option than not using them
A learning experience with iOS's fingerprint recognition »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Apr 7 01:13:23 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.