Wandering Thoughts archives

2010-12-16

The elements of a non-event

Today we had an entire iSCSI backend fail. It was a heart-stopping non-event, something that took me perhaps 20 minutes to deal with. I'd like to run down some of the reasons why things worked out this way.

  • we have set up monitoring and it works (the former doesn't always imply the latter). Smartd on the iSCSI backend started mailing us about 'cannot open device' errors almost immediately, which are never a good sign, and the ZFS pool health monitoring on the ZFS fileservers raised its own alerts soon afterwards.

  • we had a hot spare backend set up and ready.

  • after feedback from my co-workers, our new custom ZFS spare management system was explicitly designed in part to make handling this situation easy and almost completely automatic.

    (My co-workers rightfully pointed out that replacing a whole backend worth of disks was one of the most tedious, time-consuming, and repetitive things that we need to do with spares. And also, sadly, one of the more common ones. Apparently we need better power supplies and UPSes.)

  • we have a documented procedure for just this situation. When disaster struck at 5:50pm with only me in the office, I did not have to try to remember everything necessary and where all the files were and what order to do things in; once I got my heart rate under control and calmed down a bit, all I had to do was look it up and follow the steps.

I cannot overstate the importance of the last factor. In honest but embarrassing fact, I started fumbling through the necessary steps from memory and got the order wrong before I calmed down enough to come to my senses and look things up. Well, not so much look things up as stumble over the documentation in the process of looking up the command I needed to run to do what I thought was the next step, at which point I felt rather foolish and sheepish.

(This is especially ironic because I wrote the documentation myself.)

sysadmin/NoneventElements written at 01:09:46; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.