Thinking through what can go badly with databases on ZFS

January 24, 2021

Famously, if you're running a database with its storage on ZFS and you care about performance, you need to tune various ZFS parameters for the filesystem (or filesystems) that the database is on. You especially need to tune the ZFS recordsize property; generally people will say that if you change only one thing, you should change this to be either the same size as your database's block size or perhaps twice its size. But this raises a question for a certain sort of person, namely what goes badly when you leave ZFS's recordsize alone and run a database anyway. I can't answer this from experiments and experience (we've never tried to run performance sensitive databases on our ZFS fileservers), but I can work through this based on knowledge of how ZFS works. I'm going to assume SSD or NVMe storage; if you're still running a database on spinning rust and trying for performance, ZFS's recordsize setting is the least of your problems.

(Examples of tuning recommendations include this [PDF] (via) or Let's Encrypt's ZFS datastore for MariaDB (via).)

The default ZFS recordsize is 128 Kb. What this means is that once a file is 128 Kb or larger, it's stored in logical blocks that are 128 Kb in size (this is the size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes must be done to entire (logical) blocks at once, even if at the user level you only want to read or write a small amount of data. This 128 Kb logical block IO forces overheads on both database reads and especially database writes.

For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.

Things are worse for database writes. To update an 8 Kb or 16 Kb chunk, ZFS must read the 128 Kb block into memory if it's not already there (taking the read overheads, including latency), checksum and likely compress the new version of the 128 Kb block, allocate new disk space for it all, and write it. Importantly, the same read, modify, and write process is required most of the time if you're appending to a file, such as a database's write-ahead log. When the database fsync()s its data (either for its log or for the main data files), ZFS may also write the full data into the ZFS Intent Log. Because a fsync() forces the disk to flush data to durable storage and the time this takes usually depends on how much data there is to flush, I think the increased data written to the ZIL will increase fsync() latency and thus transaction commit latency.

(It's not clear to me if a partial write of a block in a file that has hit the full recordsize writes only the new user-level data to the ZIL or if the ZIL includes the full block, probably out of line but still forced to disk.)

On modern SSDs and NVMe drives, there's a limited internal drive cache of fast storage for buffering writes before they have to be put on the slower main flash. If your database has a high enough write volume, the extra data that has to be written with a 128 Kb recordsize might push the drive out of that fast write storage and slow down all writes. I suspect that most people don't have that much write traffic and that this isn't a real concern; my impression is that people normally hit this drive limit with sustained asynchronous writes.

PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.

PPS: I have some thoughts about the common recommendation for a logbias of throughput on modern storage, but that needs another entry. The short version is that what throughput really does is complicated and it may not be to your benefit today on devices where random IO is free and write bandwidth is high.

(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)

Written on 24 January 2021.
« SMART Threshold numbers turn out to not be useful for us in practice
Your monitoring and alerts remember things for you »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 24 00:47:58 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.