Your monitoring and alerts remember things for you
On Twitter, I said:
Sure, this TLS certificate expires in 2030, and we might well not even be using the service it's for in five years. But that just makes it more important to monitor its expiry time; a decade is easily long enough for us to completely forget about the issue.
Just like scripts remember things for you, so do alerts and monitoring in general. It's vaguely possible that we'd remember to keep an eye on obscure self-signed TLS certificates that expire in a decade, but probably not, and anyway if we automate it that's one less thing to keep in our minds. As system administrators, we have a lot of things to keep track of, and as squishy humans, we can only keep track of so many of them before stuff gets dropped.
(Because someone is going to wonder, this particular self-signed certificate has to be manually registered with and approved by a third party before we can use it. Doing this even every year is not something we want to get into. Since the third party is happy to accept a TLS certificate with a very long lifetime, we'll use that.)
Of course what our monitoring and alerts remember can become obsolete over time. Fortunately, they generally come with built-in reminders about their existence, so eventually we will get prompted to remove or update them. Well, most of the time. If we switched to using another TLS certificate for this particular case but left the old one behind, our monitoring might not notice (for reasons, it has to look at the TLS certificate file instead of making a TLS connection to something). But this is still better than possibly completely forgetting about the TLS certificate.
(Alerts that are just there to make sure you remember something should normally be silent. If they're going off more than once in a blue moon, you probably have one of a number of problems.)
Thinking through what can go badly with databases on ZFS
Famously, if you're running a database with its storage on ZFS and
you care about performance, you need to tune various ZFS parameters
for the filesystem (or filesystems) that the database is on. You
especially need to tune the ZFS
recordsize property; generally
people will say that if you change only one thing, you should change
this to be either the same size as your database's block size or
perhaps twice its size. But this raises a question for a certain
sort of person, namely what goes badly when you leave ZFS's
recordsize alone and run a database anyway. I can't answer this
from experiments and experience (we've never tried to run performance
sensitive databases on our ZFS fileservers), but I can work through this based
on knowledge of how ZFS works. I'm going to assume SSD or NVMe
storage; if you're still running a database on spinning rust and
trying for performance, ZFS's
recordsize setting is the least of
The default ZFS
recordsize is 128 Kb. What this means is that once a file is 128 Kb or larger,
it's stored in logical blocks that are 128 Kb in size (this is the
size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes
must be done to entire (logical) blocks at once, even if at the
user level you only want to read or write a small amount of data.
This 128 Kb logical block IO forces overheads on both database reads
and especially database writes.
For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.
Things are worse for database writes. To update an 8 Kb or 16 Kb
chunk, ZFS must read the 128 Kb block into memory if it's not already
there (taking the read overheads, including latency), checksum and
likely compress the new version of the 128 Kb block, allocate new
disk space for it all, and write it. Importantly, the same read,
modify, and write process is required most of the time if you're
appending to a file, such as a database's write-ahead log. When the
fsync()s its data (either for its log or for the main
data files), ZFS may also write the full data into the ZFS Intent
Log. Because a
fsync() forces the disk to flush
data to durable storage and the time this takes usually depends on
how much data there is to flush, I think the increased data written
to the ZIL will increase
fsync() latency and thus transaction
(It's not clear to me if a partial write of a block in a file that
has hit the full
recordsize writes only the new user-level data
to the ZIL or if the ZIL includes the full block, probably out
of line but still forced to disk.)
On modern SSDs and NVMe drives, there's a limited internal drive cache
of fast storage for buffering writes before they have to be put on the
slower main flash. If your database has a high enough write volume, the
extra data that has to be written with a 128 Kb
recordsize might push
the drive out of that fast write storage and slow down all writes. I
suspect that most people don't have that much write traffic and that
this isn't a real concern; my impression is that people normally hit
this drive limit with sustained asynchronous writes.
PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.
PPS: I have some thoughts about the common recommendation for a
throughput on modern storage, but that needs another
entry. The short version is that what
throughput really does is
complicated and it may not be to your benefit today on devices where
random IO is free and write bandwidth is high.
(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)