Thinking through what can go badly with databases on ZFS
Famously, if you're running a database with its storage on ZFS and
you care about performance, you need to tune various ZFS parameters
for the filesystem (or filesystems) that the database is on. You
especially need to tune the ZFS recordsize
property; generally
people will say that if you change only one thing, you should change
this to be either the same size as your database's block size or
perhaps twice its size. But this raises a question for a certain
sort of person, namely what goes badly when you leave ZFS's
recordsize
alone and run a database anyway. I can't answer this
from experiments and experience (we've never tried to run performance
sensitive databases on our ZFS fileservers), but I can work through this based
on knowledge of how ZFS works. I'm going to assume SSD or NVMe
storage; if you're still running a database on spinning rust and
trying for performance, ZFS's recordsize
setting is the least of
your problems.
(Examples of tuning recommendations include this [PDF] (via) or Let's Encrypt's ZFS datastore for MariaDB (via).)
The default ZFS recordsize
is 128 Kb. What this means is that once a file is 128 Kb or larger,
it's stored in logical blocks that are 128 Kb in size (this is the
size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes
must be done to entire (logical) blocks at once, even if at the
user level you only want to read or write a small amount of data.
This 128 Kb logical block IO forces overheads on both database reads
and especially database writes.
For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.
Things are worse for database writes. To update an 8 Kb or 16 Kb
chunk, ZFS must read the 128 Kb block into memory if it's not already
there (taking the read overheads, including latency), checksum and
likely compress the new version of the 128 Kb block, allocate new
disk space for it all, and write it. Importantly, the same read,
modify, and write process is required most of the time if you're
appending to a file, such as a database's write-ahead log. When the
database fsync()
s its data (either for its log or for the main
data files), ZFS may also write the full data into the ZFS Intent
Log. Because a fsync()
forces the disk to flush
data to durable storage and the time this takes usually depends on
how much data there is to flush, I think the increased data written
to the ZIL will increase fsync()
latency and thus transaction
commit latency.
(It's not clear to me if a partial write of a block in a file that
has hit the full recordsize
writes only the new user-level data
to the ZIL or if the ZIL includes the full block, probably out
of line but still forced to disk.)
On modern SSDs and NVMe drives, there's a limited internal drive cache
of fast storage for buffering writes before they have to be put on the
slower main flash. If your database has a high enough write volume, the
extra data that has to be written with a 128 Kb recordsize
might push
the drive out of that fast write storage and slow down all writes. I
suspect that most people don't have that much write traffic and that
this isn't a real concern; my impression is that people normally hit
this drive limit with sustained asynchronous writes.
PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.
PPS: I have some thoughts about the common recommendation for a
logbias
of throughput
on modern storage, but that needs another
entry. The short version is that what throughput
really does is
complicated and it may not be to your benefit today on devices where
random IO is free and write bandwidth is high.
(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)
|
|