Document your mistakes and then try to block them in the future

March 23, 2022

Today we had a somewhat complicated maintenance downtime to move one of our core filesystems around. As is my habit, I wrote a checklist, evolving it to be very detailed, down to the actual commands I was going to carry out, and subjecting it to as much scrutiny as I could (including looking over the documentation from the last time we moved this filesystem). When the time came tonight I faithfully followed the steps I'd written down, and in the process I committed a couple of small mis-steps. They weren't fatal, and one of them only delayed things with no other consequences, but both of them could have been avoided if I thought about it more.

I'm going to be sending my checklist to our 'worklog' system so that we can easily find it in the future (a lesson I learned the hard way). Right at the top I'm going to put a section on 'things not to do next time', covering both of my-steps and possibly other things if I can think of them. Hopefully we'll carry forward some version of the items there into future versions of this checklist (when we migrate the filesystem again), to preserve these lessons learned even if we don't make the same mistakes the next time around.

While I understand why I made these mis-steps, I'm also going to try to improve our environment so that people following my approach to writing up a checklist can't make them in the future. In other words, I'm going to try to block us from making those mistakes at all. Some of this is going to be through improving and changing documentation, and some of this is probably going to involve modifying a script that produces a canned set of steps for filesystem migrations so that it omits some steps for our administrative filesystems. If I manage to pull it off, these will make my 'things not to do' section more or less moot, but that's okay. I'd rather have both.

Doing this successfully requires me to remember and understand how I came up with the checklist entries that were mis-steps, and more broadly to understand why I made these mistakes. This needs to be done as soon as possible, while things are fresh in your memory, and I'm lucky that I wrote this checklist recently (I started it within the past week) so I can remember where things came from and what I was thinking. If I'd let the checklist sit for a month and then came back to block these mis-steps, it would probably be hopeless.

Written on 23 March 2022.
« Getting a fixed baud rate on your serial ports for logins under systemd
Some notes on Linux's /proc/locks listing of file locks »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 23 23:26:21 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.