Being reminded that an obvious problem isn't necessarily obvious
The other day we had a problem with one of our NFS fileservers, where a ZFS filesystem filled up to its quota limit, people kept writing to the filesystem at high volume, and the fileserver got unhappy. This nice neat description hides the fact that it took me some time to notice that the one filesystem that our DTrace scripts were pointing to as having all of the slow NFS IO was a full filesystem. Then and only then did the penny finally start dropping (which led me to a temporary fix).
(I should note that we had Amanda backups and a ZFS pool scrub happening on the fileserver at the time, so there were a number of ways it could have been overwhelmed.)
In the immediate aftermath, I felt a bit silly for missing such an obvious issue. I'm pretty sure we've seen the 'full filesystem plus ongoing writes leads to problems' issue, and we've certainly seen similar problems with full pools. In fact four years ago I wrote an entry about remembering to check for this sort of stuff in a crisis. Then I thought about it more and kicked myself for hindsight bias.
The reality of sysadmin life is that in many situations, there are too many obvious problem causes to keep track of them all. We will remember common 'obvious' things, by which I mean things that keep happening to us. But fallible humans with limited memories simply can't keep track of infrequent things that are merely easy to spot if you remember where to look. These things are 'obvious' in a technical sense, but they are not in a practical sense.
This is one reason why having a pre-written list of things to check is so potentially useful; it effectively remembers all of these obvious problem causes for you. You could just write them all down by themselves, but generally you might as well start by describing what to check and only then say 'if this check is positive ...'. You can also turn these checks (or some of them) into a script that you run and that reports anything it finds, or create a dashboard in your monitoring and alert system. There are lots of options.
(Will we try to create such a checklist or diagnosis script? Probably not for our current fileservers, since they're getting replaced with a completely different OS in hopefully not too much time. Instead we'll just hope that we don't have more problems over their remaining lifetime, and probably I'll remember to check for full filesystems if this happens again in the near future.)
Sidebar: Why our (limited) alerting system didn't tell us anything
The simple version is that our system can't alert us only on the combination of a full filesystem, NFS problems with that fileserver, and perhaps an observed high write volume to it. Instead the best it can do is alert us on full filesystems alone, and that happens too often to be useful (especially since it's not something we can do anything about).
Comments on this page:Written on 30 July 2018.