The elements of a non-event

December 16, 2010

Today we had an entire iSCSI backend fail. It was a heart-stopping non-event, something that took me perhaps 20 minutes to deal with. I'd like to run down some of the reasons why things worked out this way.

  • we have set up monitoring and it works (the former doesn't always imply the latter). Smartd on the iSCSI backend started mailing us about 'cannot open device' errors almost immediately, which are never a good sign, and the ZFS pool health monitoring on the ZFS fileservers raised its own alerts soon afterwards.

  • we had a hot spare backend set up and ready.

  • after feedback from my co-workers, our new custom ZFS spare management system was explicitly designed in part to make handling this situation easy and almost completely automatic.

    (My co-workers rightfully pointed out that replacing a whole backend worth of disks was one of the most tedious, time-consuming, and repetitive things that we need to do with spares. And also, sadly, one of the more common ones. Apparently we need better power supplies and UPSes.)

  • we have a documented procedure for just this situation. When disaster struck at 5:50pm with only me in the office, I did not have to try to remember everything necessary and where all the files were and what order to do things in; once I got my heart rate under control and calmed down a bit, all I had to do was look it up and follow the steps.

I cannot overstate the importance of the last factor. In honest but embarrassing fact, I started fumbling through the necessary steps from memory and got the order wrong before I calmed down enough to come to my senses and look things up. Well, not so much look things up as stumble over the documentation in the process of looking up the command I needed to run to do what I thought was the next step, at which point I felt rather foolish and sheepish.

(This is especially ironic because I wrote the documentation myself.)

Written on 16 December 2010.
« Always remember that people make mistakes
Sometimes bugs have very small edit distances »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 16 01:09:46 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.