2013-09-23
ZFS filesystem compression and quotas
ZFS filesystem compression is widely seen as basically a universally good thing (unlike deduplication); turning it on almost always gives you a clear space gain for what is generally a minor cost. Unfortunately it turns out to have an odd drawback in our environment in how it interacts with ZFS's disk quotas. Put simply, ZFS disk quotas limit the physical space consumed by a filesystem, not the logical space. In other words they limit how much post-compression disk space a filesystem can use instead of the pre-compression space. This has two drawbacks.
The first drawback is simply the user experience. In some situations writing 10 GB to a filesystem with 10 GB of quota space left will fill it up; in other situations you'll be left with a somewhat unpredictable amount of space free afterwards. Similarly if you have 10 GB free and rewrite portions of an existing file (perhaps you have a database writing and rewriting records), your free space can go down. Or up. All of this can be explained but generally not predicted and I think it's going to be at least a bit surprising to people.
(Of course these user experience problems exist even without quotas, because your pool only has so much space and how that space gets used gets unpredictable.)
The more significant problem for us is that we primarily use quotas to limit how much data we have to back up for a single filesystem. Here the space usage we care about and want to limit is actually the raw, pre-compression space usage. We don't care how much space a filesystem takes on disk, we care how much space it will take on backups (and we generally don't want to compress our backups for various reasons). Quotas based on logical space consumed would be much more useful to us than the current ZFS quotas.
(Since we have to recreate all of our pools anyways I've been thinking about whether we want to change our standard pool and filesystem configurations. My tentative conclusion is that we don't want to turn compression on, largely because of the backup issue combined with it probably not saving people significant amounts of space.)
2013-09-02
A little bit more on ZFS RAIDZ read performance
Back in this entry I talked about how all levels of ZFS RAIDZ had an unexpected read performance hit: they can't read less than a full stripe, so instead of the IOPS of N disks you get the IOPS of one disk. Well, it was recently pointed out to me that this is not quite correct. It is true that ZFS reads all of the stripe of a data block on reads; however, ZFS does not read the parity chunks (unless the block does not checksum correctly and needs to be repaired).
In normal RAIDZ pools the difference between 'all disks' and 'all disks except the parity disks' is small. If the parity for the stripes you're reading bits of are evenly spread over all of the disks, you might get somewhat more than one disk's IOPS on aggregate. Where this can matter is in very small RAIDZ pools, for example a four-disk RAIDZ2 pool. Here half your drives are parity drives for any particular data block and you may get something more like two disks of IOPS.
(A four-disk RAIDZ2 vdev is actually an interesting thing and potentially useful; it's basically a more resilient but potentially slower version of a two-vdev set of mirrors. You lose half of your disk space, as with mirroring, but you can withstand the failure of any two disks (unlike mirroring).)
To add some more RAIDZ parity trivia: RAIDZ parity is read and verified during scrubs (and thus likely resilvers), which is what you want. Data block checksums are as well of course, which means that reads on scrubs genuinely busy all drives.
Sidebar: small write blocks and read IOPS
Another way that you can theoretically get more than one disk's IOPS from a RAIDZ vdev is if the data was written in sufficiently small blocks. As I mentioned in passing here, ZFS doesn't have a fixed 'stripe size' and a small write will only put data (and parity) on less than N disks. In turn reading back this data will need less than N (minus parity) disks, meaning that if you have good luck you can read another small block from the other drives at the same time.
Since 'one sector' is the minimum amount of data to put on a single drive, this is probably much more likely now in the days of disks with 4096-byte sectors than it was on 512-byte sector drives. If you have a ten-disk RAIDZ2 on 4k disks, for example, it now takes a 32 KB data block to wind up on all 8 possible data drives.
(On 512-byte sector disks it would have only needed a 4KB data block.)