During your crisis, remember to look for anomalies

October 18, 2014

This is a war story.

Today I had one of those valuable learning experiences for a system administrator. What happened is that one of our old fileservers locked up mysteriously, so we power cycled it. Then it locked up again. And again (and an attempt to get a crash dump failed). We thought it might be hardware related, so we transplanted the system disks into an entirely new chassis (with more memory, because there was some indications that it might be running out of memory somehow). It still locked up. Each lockup took maybe ten or fifteen minutes from the reboot, and things were all the more alarming and mysterious because this particular old fileserver only had a handful of production filesystems still on it; almost all of them had been migrated to one of our new fileservers. After one more lockup we gave up and went with our panic plan: we disabled NFS and set up to do an emergency migration of the remaining filesystems to the appropriate new fileserver.

Only as we started the first filesystem migration did we notice that one of the ZFS pools was completely full (so full it could not make a ZFS snapshot). As we were freeing up some space in the pool, a little light came on in the back of my mind; I remembered reading something about how full ZFS pools on our ancient version of Solaris could be very bad news, and I was pretty sure that earlier I'd seen a bunch of NFS write IO at least being attempted against the pool. Rather than migrate the filesystem after the pool had some free space, we selectively re-enabled NFS fileservice. The fileserver stayed up. We enabled more NFS fileservice. And things stayed happy. At this point we're pretty sure that we found the actual cause of all of our fileserver problems today.

(Afterwards I discovered that we had run into something like this before.)

What this has taught me is during an inexplicable crisis, I should try to take a bit of time to look for anomalies. Not specific anomalies, but general ones; things about the state of the system that aren't right or don't seem right.

(There is a certain amount of hindsight bias in this advice, but I want to mull that over a bit before I wrote more about it. The more I think about it the more complicated real crisis response becomes.)


Comments on this page:

By Ewen McNeill at 2014-10-18 02:50:48:

There's definitely a lot to be said for following those "well that sure won't be making things better" symptoms if nothing else obviously seems to be the cause. Even if they don't seem immediately connected, sometimes they get you near enough to the real cause that you can figure out what's going on. I found another one a few weeks back by following the chain back from unexplained high CPU usage, which eventually turned out to have a file permissions change as the root cause (and a long chain of events in between).

Ewen

By PerryLorier at 2014-10-18 07:15:42:

If you have a trivial to measure a metric that relates to the health of your server, why not set up some alerting. If "This pool is full" means that servers can lock up, then alerting saying "This pool is nearly pool" means you can find/resolve the issue before the server locks up.

If monitoring had been added the first time the issue had been noticed, this event would have become a "WARNING: Pool approaching full. Possible fileserver lock up.".

Then you also have a list of tests you can use them as part of your qualification testing for new versions :)

Written on 18 October 2014.
« My experience doing relatively low level X stuff in Go
Vegeta, a tool for web server stress testing »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 18 00:54:50 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.