An interaction of low ZFS recordsize, compression, and advanced format disks

May 2, 2018

Suppose that you have something with a low ZFS recordsize; a classical example is zvols, where people often use an 8 Kb volblocksize. You have compression turned on, and you are using a pool (or vdev) with ashift=12 because it's on 'advanced format' drives or you're preparing for that possibility. This seems especially likely on SSDs, some of which are already claiming to be 4K physical sector drives.

In this situation, you will probably get much lower compression ratios than you expect, even with reasonably compressible data. There are two reasons for this, the obvious one and the inobvious one. The obvious one is that ZFS compresses each logical block separately, and your logical blocks are small. Generally the larger the things you compress at once, the better most compression algorithms do, up to a reasonable size; if you use a small size, you get not as good results and less compression.

(The lz4 command line compression program doesn't even have an option to compress in less than 64 Kb blocks (cf), which shows you what people think of the idea. The lz4 algorithm can be applied to smaller blocks, and ZFS does, but presumably the results are not as good.)

The inobvious problem is how a small recordsize interacts with a large physical block size (ie, a large ashift). In order to save any space on disk, compression has to shrink the data enough so that it uses fewer disk blocks. With 4 Kb disk blocks (an ashift of 12), this means you need to compress things down by at least 4 Kb; when you're starting with 8 Kb logical blocks because of your 8 Kb recordsize, this means you need at least 50% compression in order to save any space at all. If your data is compressible but not that compressible, you can't save any allocated space.

A larger recordsize gives you more room to at least save some space. With a 128 Kb recordsize, you need only compress a bit (to 120 Kb, about 7% compression) in order to save one 4 Kb disk block. Further increases in compression can get you more savings, bit by bit, because you have more disk blocks to shave away.

(An ashift=9 pool similarly gives you more room to get wins from compression because you can save space in 512 byte increments, instead of needing to come up with 4 Kb of space savings at a time.)

(Writing this up as an entry was sparked by this ZFS discussion.)

PS: I believe that this implies that if your recordsize (or volblocksize) is the same as the disk physical block size (or ashift size), compression will never do anything for you. I'm not sure if ZFS will even try to run the compression code or if it will silently pretend that you have compression=off set.

Comments on this page:

In the case where logical blocks are the same size as physical blocks, you are correct that compression per se is useless. ZFS still completely drops blocks of all zeros, but only if compression is enabled. So in that case, you can use compression=zle and get all the benefit with the least overhead.

By Paul Arakelyan at 2018-05-19 03:10:50:

If you have those 4KB/sector or even 8KB/sector drives and try to get the best results using compression - you are in double trouble. Each recordsize-sized (same applies to volblocksize) data can be compressed into N*blocksize, so to compress 8KB into 4KB - your data must be compressible in 2:1 and you'll not get the rate better than 2:1. You can increase volblocksize or recordsize e.g. to 16KB - so you'll be getting 4:3...4:1 ratios and so forth - to make the achievable ratio closer to real ratio (unless you have small files, or lots of files that have a small part of them "hanging out" - like 17KB files on 16KB recordsize filesystem will not give you better than 2 blocks on disk).

On one hand - increasing the recordsize up to 1MB gives you better compression rates (and slower compression as well), on the other - you get insane overheads when it comes to reading and rewriting small fragments - e.g. instead of reading 4 or 8KB - you have to read some blocks, decompress them to 1MB, pass the needed 4KB to userland. Things get even worse with writing back and you get wasted bandwidth, wasted CPU cycles and write amplification - all at once.

Once I had to run a 120+GB(and growing) MySQL database (+closed-source app that was never intended for such sizes - but that's another story) on an Intel X25M Gen2 80GB SSD - the best solution was the weirdest idea: 2KB blocksize (yes, on a 4KB-sectors drive!) and 16KB recordsize - that way the compression ratio was still decent and the overhead not noticeable.

Written on 02 May 2018.
« You probably need to think about how to handle core dumps on modern Linux servers
Using grep to hunt around for null bytes in text files »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 2 01:27:30 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.