2013-07-07
Sometimes the right thing to do is nothing (at least right then)
As system administrators we tend to have a tropism towards heroism. If there is a problem and something that can be done, then by gum we feel that we should do it. No matter what the exact circumstances, sitting on our hands feels very wrong.
We just lost an entire iSCSI backend for one of our fileservers. This has happened before and it took only a brief amount of sysadmin work to deal with, but this time around there are three things different. First, this happened at 11pm on a Sunday night. Second, there isn't anyone in physically in the office. Third, there are some anomalies about the current state of our hot spare iSCSI backend.
(As a corollary to the second issue I don't know exactly what went wrong with the iSCSI backend, beyond all of its data disks disappearing from the system. There any number of potential causes.)
I could heroically spring into action anyways; imbibe a bunch of caffeine, either work remotely or race in to the office, ignore the anomaly as unimportant or kludge things together. But as I started to think about this and plan what I'd need, a little voice in the back of my head piped up to ask: are you crazy?
Several rested, alert sysadmins are going to be in the office in approximately ten hours (possibly less). Thus, putting the hot spare backend into production right now will gain us at most a ten hour head start on resynchronizing several terabytes of disk space. This is not completely insignificant but it's also not particularly huge (I expect the resync to take days, even if we run it flat out and accept the impact on users). Set against that moderate potential gain is the large potential downsides if something goes wrong for any number of causes.
One of the rules of sysadmin crisis response should be do no harm, and one of our jobs is to evaluate our heroic impulses and urges against that standard. Sometimes the right answer is to do nothing because we cannot be confident enough that our actions are sure to improve the situation instead of making it worse.
Am I confident that I'm making the right decision here? No. Not at all. It's almost certain that I could put the hot spare backend into production without problems and then we'd have a ten hour head start. But that 'almost' stays my hand.
(Note that we don't have any requirement to provide crisis response outside of working hours. In many organizations the sysadmins are on the hook for out of hours responses and this would be considered a sufficiently important crisis to force people into action. I think that those organizations may be making a mistake for reasons connected to why me doing things could be a bad idea, but that's another entry.)
A mistake to avoid with summer interns
If you're part of a university and you have both some spare money and some work that you'd like to get done but don't have the time and energy for with your existing staff, one traditional solution is to hire a student or two for the summer. We've done this in the past and in retrospect we made a mistake or two in the process. Today I want to write about it, partly so that I can hopefully avoid mistakes in the future.
The big mistake to avoid is do not abandon your summer intern in a corner. Even if your interns are perfectly competent (which ours have been) and are working on completely self contained projects, there are two things that will go wrong here.
The obvious problem is that what you will get at the end of the summer is a black box, because you won't have been involved in developing or doing whatever your intern worked on. Even if your intern has meticulously documented everything about it you're going to have to read that documentation first and the odds are very good that the documentation will turn out to be not good enough. This is especially likely if you don't read the documentation until the end of the summer, when the intern is leaving. The less obvious problem is that there probably will be design issues with how the project is constructed and how it works. Your summer intern is likely quite competent, but they are still an inexperienced student not an experienced sysadmin or programmer like you are (and especially they're not familiar with your specific environment and so on).
In retrospect, none of this should be surprising to me. We really need to treat summer interns as (very) junior people that we actively supervise and work with, not as magic black boxes where we insert requests and get perfect results back out from. The corollary, which I hope I remember in the future, is that it's a mistake to get summer interns if what we really want or need is a black box.
(Abandoning interns in a corner may sound crazy, but trust me, it came about in a very natural way. When you're already busy yourself it takes active work to carve out time to work with an intern, and because you're busy doing so feels like an imposition that's slowing you down. It's very tempting to think that you don't really need to or just to let it slip, although this is not a good thing.)
PS: there are other reasons not to do this that have to do with the intern's experience, but that's a topic for another entry.