Today's learning experience is that gzip is not fast
For reasons beyond the scope of this entry, we have a quite large
/var/mail and we take a full backup of it every night. In order
to save space in our disk-based backup system,
for years we've been having Amanda compress these backups on the
Amanda server; since we're backing up ASCII text (even if it
represents encoded and compressed binary things), they generally
compress very well. We did this in the straightforward way; as part
of our special Amanda dump type that forces only full backups for
/var/mail, we said '
compress server best'. This worked okay for
years, which enticed us into not looking at it too hard until we
recently noticed that our backups of
/var/mail were taking almost
(They should not take ten hours.
/var/mail is only about 540 GB
and it's on SSDs.)
It turns out that Amanda's default compression uses gzip, and when you tell Amanda to
use the best compression it uses '
gzip --best', aka '
Now, I was vaguely aware that gzip is not the fastest compression
method in the world (if only because ZFS uses lz4 compression by
default and recommends you avoid gzip), but I also had the vague
impression that it was reasonably decently okay as far as speed
went (and I knew that bzip2 and xz were slower, although they
compress better). Unfortunately my impression turns out to be very
wrong. Gzip is a depressingly slow compression system, especially
if you tell it to go wild and try to get the best compression it
can. Specifically, on our current Amanda server hardware '
--best' appears to manage a rate of only about 16 MBytes a second.
As a result, our backups of
/var/mail are almost entirely constrained
by how slowly gzip runs.
(See lz4's handy benchmark chart for one source of speed numbers. Gzip is 'zlib deflate', and zlib at the 'compress at all costs' -9 level isn't even on the benchmark chart.)
The good news is that there are faster compression programs out there, and at least some of them are available pre-packaged for Ubuntu. We're currently trying out zstd as probably having a good balance between running fast enough for us and having a good compression ratio. Compressing with lz4 would be significantly faster, but it also appears that it would get noticeably less compression.
It's worth noting that not even lz4 can keep up with full 10G Ethernet speeds (on most machines). If you have a disk system that can run fast enough (which is not difficult with modern SSDs) and you want to saturate your 10G network during backups, you can't do compression in-stream; you're going to have to capture the backup stream to disk and then compress it later.
PS: There's also parallel gzip, but that has various limitations in practice; you might have multiple backup streams to compress, and you might need that CPU for other things too.