Wandering Thoughts archives


Your monitoring and alerts remember things for you

On Twitter, I said:

Sure, this TLS certificate expires in 2030, and we might well not even be using the service it's for in five years. But that just makes it more important to monitor its expiry time; a decade is easily long enough for us to completely forget about the issue.

(There's a story behind my action.)

Just like scripts remember things for you, so do alerts and monitoring in general. It's vaguely possible that we'd remember to keep an eye on obscure self-signed TLS certificates that expire in a decade, but probably not, and anyway if we automate it that's one less thing to keep in our minds. As system administrators, we have a lot of things to keep track of, and as squishy humans, we can only keep track of so many of them before stuff gets dropped.

(Because someone is going to wonder, this particular self-signed certificate has to be manually registered with and approved by a third party before we can use it. Doing this even every year is not something we want to get into. Since the third party is happy to accept a TLS certificate with a very long lifetime, we'll use that.)

Of course what our monitoring and alerts remember can become obsolete over time. Fortunately, they generally come with built-in reminders about their existence, so eventually we will get prompted to remove or update them. Well, most of the time. If we switched to using another TLS certificate for this particular case but left the old one behind, our monitoring might not notice (for reasons, it has to look at the TLS certificate file instead of making a TLS connection to something). But this is still better than possibly completely forgetting about the TLS certificate.

(Alerts that are just there to make sure you remember something should normally be silent. If they're going off more than once in a blue moon, you probably have one of a number of problems.)

sysadmin/MonitoringAndAlertsRemember written at 23:43:15; Add Comment

Thinking through what can go badly with databases on ZFS

Famously, if you're running a database with its storage on ZFS and you care about performance, you need to tune various ZFS parameters for the filesystem (or filesystems) that the database is on. You especially need to tune the ZFS recordsize property; generally people will say that if you change only one thing, you should change this to be either the same size as your database's block size or perhaps twice its size. But this raises a question for a certain sort of person, namely what goes badly when you leave ZFS's recordsize alone and run a database anyway. I can't answer this from experiments and experience (we've never tried to run performance sensitive databases on our ZFS fileservers), but I can work through this based on knowledge of how ZFS works. I'm going to assume SSD or NVMe storage; if you're still running a database on spinning rust and trying for performance, ZFS's recordsize setting is the least of your problems.

(Examples of tuning recommendations include this [PDF] (via) or Let's Encrypt's ZFS datastore for MariaDB (via).)

The default ZFS recordsize is 128 Kb. What this means is that once a file is 128 Kb or larger, it's stored in logical blocks that are 128 Kb in size (this is the size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes must be done to entire (logical) blocks at once, even if at the user level you only want to read or write a small amount of data. This 128 Kb logical block IO forces overheads on both database reads and especially database writes.

For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.

Things are worse for database writes. To update an 8 Kb or 16 Kb chunk, ZFS must read the 128 Kb block into memory if it's not already there (taking the read overheads, including latency), checksum and likely compress the new version of the 128 Kb block, allocate new disk space for it all, and write it. Importantly, the same read, modify, and write process is required most of the time if you're appending to a file, such as a database's write-ahead log. When the database fsync()s its data (either for its log or for the main data files), ZFS may also write the full data into the ZFS Intent Log. Because a fsync() forces the disk to flush data to durable storage and the time this takes usually depends on how much data there is to flush, I think the increased data written to the ZIL will increase fsync() latency and thus transaction commit latency.

(It's not clear to me if a partial write of a block in a file that has hit the full recordsize writes only the new user-level data to the ZIL or if the ZIL includes the full block, probably out of line but still forced to disk.)

On modern SSDs and NVMe drives, there's a limited internal drive cache of fast storage for buffering writes before they have to be put on the slower main flash. If your database has a high enough write volume, the extra data that has to be written with a 128 Kb recordsize might push the drive out of that fast write storage and slow down all writes. I suspect that most people don't have that much write traffic and that this isn't a real concern; my impression is that people normally hit this drive limit with sustained asynchronous writes.

PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.

PPS: I have some thoughts about the common recommendation for a logbias of throughput on modern storage, but that needs another entry. The short version is that what throughput really does is complicated and it may not be to your benefit today on devices where random IO is free and write bandwidth is high.

(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)

solaris/ZFSDatabasesWhatHappens written at 00:47:58; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.