2018-04-07
Some numbers for how well various compressors do with our /var/mail
backup
Recently I discussed how gzip --best
wasn't very fast when
compressing our Amanda (tar) backup of /var/mail
,
and mentioned that we were trying out zstd
for this. As it happens, as part of our research on this issue I
ran one particular night's backup of our /var/mail
through all
of the various compressors to see how large they'd come out, and
I think the numbers are usefully illustrative.
The initial uncompressed tar archive is roughly 538 GB and is
probably almost completely ASCII text (since we use traditional
mbox format inboxes and most email is encoded to 7-bit ASCII). The
compression ratios are relative to the uncompressed file, while the
times are relative to the fastest compression algorithm. Byte sizes
were counted with 'wc -c
', instead of writing the results to disk,
and I can be confident that the compression programs were the speed
limit on this system, not reading the initial tar archive off SSDs.
Compression ratio | Time ratio | |
uncompressed | 1.0 | 0.47 |
lz4 | 1.4 | 1.0 |
gzip --fast |
1.77 | 11.9 |
gzip --best |
1.87 | 17.5 |
zstd -1 |
1.92 | 1.7 |
zstd -3 |
1.99 | 2.4 |
(The 'uncompressed' time is for 'cat <file> | wc -c
'.)
On this very real-world test for us, zstd is clearly a winner
over gzip
; it achieves better compression with far less time.
gzip --fast
takes about 32% less time than gzip --best
at only
a moderate cost in compression ratio, but it's not competitive with
zstd in either time or compression. Zstd is not as fast as lz4
but it's fast enough, while providing clearly better compression.
We're currently using the default zstd compression level, which
is 'zstd -3
' (we're just invoking plain '/usr/bin/zstd
'). These
numbers suggest that we'd lose very little compression from switching
to 'zstd -1
' but get a significant speed increase. At the moment
we're going to leave things as they are because our backups are now
fast enough (backing up /var/mail
is now not the limiting factor
on their overall speed) and we do get something for that extra time.
Also, it's simpler; because of how Amanda works, we'd need to add
a script to switch to 'zstd -1
'.
(Amanda requires you to specify a program as your compressor, not a program plus arguments, so if you want to invoke the real compressor with some non-default options you need a cover script.)
Since someone is going to ask, pigz
-fast
got a compression ratio of 1.78 and a time ratio of 1.27.
This is extremely unrepresentative of what we could achieve in
production on our Amanda backup servers, since my test machine is
a 16-coreCPU Xeon Silver 4108. The parallelism speed
increase for pigz is not perfect, since it was only about 9.4 times
faster than gzip --fast
(which is single-core).
(Since I wanted to see the absolute best case for pigz in terms of
speed, I ran it on all cores CPUs. I'm not interested
in doing more tests to establish how it scales when run with fewer
cores CPUs, since we're not going to use it; zstd
is better for our case.)
PS: I'm not giving absolute speeds because these speeds vary tremendously across our systems and also depend on what's being compressed, even with just ASCII text.