Sometimes the right thing to do is nothing (at least right then)

July 7, 2013

As system administrators we tend to have a tropism towards heroism. If there is a problem and something that can be done, then by gum we feel that we should do it. No matter what the exact circumstances, sitting on our hands feels very wrong.

We just lost an entire iSCSI backend for one of our fileservers. This has happened before and it took only a brief amount of sysadmin work to deal with, but this time around there are three things different. First, this happened at 11pm on a Sunday night. Second, there isn't anyone in physically in the office. Third, there are some anomalies about the current state of our hot spare iSCSI backend.

(As a corollary to the second issue I don't know exactly what went wrong with the iSCSI backend, beyond all of its data disks disappearing from the system. There any number of potential causes.)

I could heroically spring into action anyways; imbibe a bunch of caffeine, either work remotely or race in to the office, ignore the anomaly as unimportant or kludge things together. But as I started to think about this and plan what I'd need, a little voice in the back of my head piped up to ask: are you crazy?

Several rested, alert sysadmins are going to be in the office in approximately ten hours (possibly less). Thus, putting the hot spare backend into production right now will gain us at most a ten hour head start on resynchronizing several terabytes of disk space. This is not completely insignificant but it's also not particularly huge (I expect the resync to take days, even if we run it flat out and accept the impact on users). Set against that moderate potential gain is the large potential downsides if something goes wrong for any number of causes.

One of the rules of sysadmin crisis response should be do no harm, and one of our jobs is to evaluate our heroic impulses and urges against that standard. Sometimes the right answer is to do nothing because we cannot be confident enough that our actions are sure to improve the situation instead of making it worse.

Am I confident that I'm making the right decision here? No. Not at all. It's almost certain that I could put the hot spare backend into production without problems and then we'd have a ten hour head start. But that 'almost' stays my hand.

(Note that we don't have any requirement to provide crisis response outside of working hours. In many organizations the sysadmins are on the hook for out of hours responses and this would be considered a sufficiently important crisis to force people into action. I think that those organizations may be making a mistake for reasons connected to why me doing things could be a bad idea, but that's another entry.)

Written on 07 July 2013.
« A mistake to avoid with summer interns
How we want to recover our ZFS pools from SAN outages »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 7 23:59:53 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.