Today's learning experience is that gzip is not fast

April 4, 2018

For reasons beyond the scope of this entry, we have a quite large /var/mail and we take a full backup of it every night. In order to save space in our disk-based backup system, for years we've been having Amanda compress these backups on the Amanda server; since we're backing up ASCII text (even if it represents encoded and compressed binary things), they generally compress very well. We did this in the straightforward way; as part of our special Amanda dump type that forces only full backups for /var/mail, we said 'compress server best'. This worked okay for years, which enticed us into not looking at it too hard until we recently noticed that our backups of /var/mail were taking almost ten hours.

(They should not take ten hours. /var/mail is only about 540 GB and it's on SSDs.)

It turns out that Amanda's default compression uses gzip, and when you tell Amanda to use the best compression it uses 'gzip --best', aka 'gzip -9'. Now, I was vaguely aware that gzip is not the fastest compression method in the world (if only because ZFS uses lz4 compression by default and recommends you avoid gzip), but I also had the vague impression that it was reasonably decently okay as far as speed went (and I knew that bzip2 and xz were slower, although they compress better). Unfortunately my impression turns out to be very wrong. Gzip is a depressingly slow compression system, especially if you tell it to go wild and try to get the best compression it can. Specifically, on our current Amanda server hardware 'gzip --best' appears to manage a rate of only about 16 MBytes a second. As a result, our backups of /var/mail are almost entirely constrained by how slowly gzip runs.

(See lz4's handy benchmark chart for one source of speed numbers. Gzip is 'zlib deflate', and zlib at the 'compress at all costs' -9 level isn't even on the benchmark chart.)

The good news is that there are faster compression programs out there, and at least some of them are available pre-packaged for Ubuntu. We're currently trying out zstd as probably having a good balance between running fast enough for us and having a good compression ratio. Compressing with lz4 would be significantly faster, but it also appears that it would get noticeably less compression.

It's worth noting that not even lz4 can keep up with full 10G Ethernet speeds (on most machines). If you have a disk system that can run fast enough (which is not difficult with modern SSDs) and you want to saturate your 10G network during backups, you can't do compression in-stream; you're going to have to capture the backup stream to disk and then compress it later.

PS: There's also parallel gzip, but that has various limitations in practice; you might have multiple backup streams to compress, and you might need that CPU for other things too.

Written on 04 April 2018.
« Sorting out my systemd mistake with a script-based service unit
Switching over to Firefox Quantum was relatively painless »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 4 02:14:06 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.