The elements of a non-event
Today we had an entire iSCSI backend fail. It was a heart-stopping non-event, something that took me perhaps 20 minutes to deal with. I'd like to run down some of the reasons why things worked out this way.
- we have set up monitoring and it works (the former doesn't always
imply the latter). Smartd on
the iSCSI backend started mailing
us about 'cannot open device' errors almost immediately, which are
never a good sign, and the ZFS pool health monitoring on the
ZFS fileservers raised its own
alerts soon afterwards.
- we had a hot spare backend set up and ready.
- after feedback from my co-workers, our new custom ZFS spare management
system was explicitly designed in part to
make handling this situation easy and almost completely automatic.
(My co-workers rightfully pointed out that replacing a whole backend worth of disks was one of the most tedious, time-consuming, and repetitive things that we need to do with spares. And also, sadly, one of the more common ones. Apparently we need better power supplies and UPSes.)
- we have a documented procedure for just this situation. When disaster struck at 5:50pm with only me in the office, I did not have to try to remember everything necessary and where all the files were and what order to do things in; once I got my heart rate under control and calmed down a bit, all I had to do was look it up and follow the steps.
I cannot overstate the importance of the last factor. In honest but embarrassing fact, I started fumbling through the necessary steps from memory and got the order wrong before I calmed down enough to come to my senses and look things up. Well, not so much look things up as stumble over the documentation in the process of looking up the command I needed to run to do what I thought was the next step, at which point I felt rather foolish and sheepish.
(This is especially ironic because I wrote the documentation myself.)