Being reminded that an obvious problem isn't necessarily obvious

July 30, 2018

The other day we had a problem with one of our NFS fileservers, where a ZFS filesystem filled up to its quota limit, people kept writing to the filesystem at high volume, and the fileserver got unhappy. This nice neat description hides the fact that it took me some time to notice that the one filesystem that our DTrace scripts were pointing to as having all of the slow NFS IO was a full filesystem. Then and only then did the penny finally start dropping (which led me to a temporary fix).

(I should note that we had Amanda backups and a ZFS pool scrub happening on the fileserver at the time, so there were a number of ways it could have been overwhelmed.)

In the immediate aftermath, I felt a bit silly for missing such an obvious issue. I'm pretty sure we've seen the 'full filesystem plus ongoing writes leads to problems' issue, and we've certainly seen similar problems with full pools. In fact four years ago I wrote an entry about remembering to check for this sort of stuff in a crisis. Then I thought about it more and kicked myself for hindsight bias.

The reality of sysadmin life is that in many situations, there are too many obvious problem causes to keep track of them all. We will remember common 'obvious' things, by which I mean things that keep happening to us. But fallible humans with limited memories simply can't keep track of infrequent things that are merely easy to spot if you remember where to look. These things are 'obvious' in a technical sense, but they are not in a practical sense.

This is one reason why having a pre-written list of things to check is so potentially useful; it effectively remembers all of these obvious problem causes for you. You could just write them all down by themselves, but generally you might as well start by describing what to check and only then say 'if this check is positive ...'. You can also turn these checks (or some of them) into a script that you run and that reports anything it finds, or create a dashboard in your monitoring and alert system. There are lots of options.

(Will we try to create such a checklist or diagnosis script? Probably not for our current fileservers, since they're getting replaced with a completely different OS in hopefully not too much time. Instead we'll just hope that we don't have more problems over their remaining lifetime, and probably I'll remember to check for full filesystems if this happens again in the near future.)

Sidebar: Why our (limited) alerting system didn't tell us anything

The simple version is that our system can't alert us only on the combination of a full filesystem, NFS problems with that fileserver, and perhaps an observed high write volume to it. Instead the best it can do is alert us on full filesystems alone, and that happens too often to be useful (especially since it's not something we can do anything about).

Comments on this page:

By Perry Lorier at 2018-08-01 04:40:59:

At Google, we have lots of machines, so we tend to rely more on consoles.

If this was a problem there, I'd recommend having an alert for "machine is uphappy" (I guess, unreachable?). When alerting you want to alert on "symptoms" not "causes", because there are an infinite number of ways a machine can be sad, but there's a finite list of things a machine should be doing, that it might have stopped doing.

This alert would then link to a console, which is effectively a checklist of things to check, and the information you need to check it. So, on a fileserver you might have a graph of "min/median/max" of filesystem fullness over all filesystems on the box. a graph of "cpu usage (or load average)", "network usage (in/out bps)", and maybe a table showing number of disks that are healthy (x/y), number of mounted filesystems (x/y) etc.

So, you get an alert that a fileserver is unhappy (down?), you click on the console link, and you can immediately see the overall state of the fileserver, maybe with links to more detailed drilldown pages (eg, a link on the min/median/max filesystem fullness, might link to a table that has a list of filesystems (mountpoint, disks-in-pool, disks-should-be-in-pool, #read/s, #write/s, used, size, %full) and that then links to a page that has even more detail to dig into for each filesystem.

Prometheus + Grafana is an opensource implementation of the systems Google uses to implement this. One common Action Item from postmortems is "go add this information to a console, so it's clear to people debugging what's gone wrong".

Debugging becomes "Click on the console, console suggests X as a root cause, click on link beside X, and it gives you more detail, click on link beside Y and it gives you even more detail". 2 or 3 clicks later you could possibly even know precisely which user has managed to fill up which filesystem, and you can contact them directly. You know that it's probably because the filesystem is full and receiving a lot of writes, and you can restart the fileserver (or whatever) without that filesystem as a temporary mitigation to allow other users to continue their work.

Written on 30 July 2018.
« Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem
My own configuration files don't have to be dotfiles in $HOME »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 30 00:59:57 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.