I should always make a checklist for anything complicated

January 23, 2023

Today I did some work on the disk setup of my home desktop and I got shot in the foot, because when you remove disks from Linux software RAID arrays and then reboot, the boot process may reassemble those RAID arrays using the disks you removed (or even just one disk), instead of the actual live disks in the RAID array. There are a number of reasons that this happened to me, but one of them is that I didn't make a checklist for what I was doing and instead did it on the fly.

I had a pair of bad justifications for why I didn't write out a checklist. First, I was doing this to my home desktop, not one of our servers at work, and it felt silly to go through the same process for a less important machine (never mind that it's a very important machine to me, especially when I'm working from home at the time). Second, I hadn't planned in advance to make this change; it was an on the fly impulse because I was rebooting the machine anyway for a kernel update. I figured I was experienced with software RAID and I could remember everything I needed to. Obviously I was wrong; this is an issue that I've had at least twice before, and the moment it happened I realized what had gone wrong (but by then it was too late to fix it easily).

(The first time this happened was to one of my desktops but I'm not sure which of them. The second time was when I replaced a bad disk on my home desktop in 2019, and I seem to have forgotten the earlier time when I wrote that entry, since it has no pointer to the the first one.)

I knew my software RAID changes were a multi-step process and there were uncertainties in the process. But I didn't take the next mental step to 'I should write up even a trivial little checklist', and so I paid for it with some excitement. Although there were positive bits in the end result, I would still have been better off writing out that checklist, even if I was doing everything on impulse.

If it's not trivial, I should make a checklist even on my home desktop, even if it feels weird. Checklists are a great thing and I should use them more often. Even if I don't completely follow the checklist, making it will make me think through everything and that has a much higher chance of jogging my mind about thing's I've already encountered before.

I have some more disk work to do as a result of the other hardware changes I also did (I'm replacing my remaining spinning rust), and they're not trivial. Since they're not trivial, I'll hopefully write out a checklist this time around instead of winging it. As I've been reminded today, it's too easy to forget things when you're working on the fly, and even the small stuff can benefit from not making mistakes.

(I should probably write more checklists when doing things on my desktops, but in my defense I rarely touch them for anything more intricate than Fedora kernel updates. Normal people have simple kernel updates, apart from needing to reboot; I complicate my life by also updating my ZFS on Linux version at the same time. At least it's simpler than it used to be.)

PS: These days I do make a checklist for my process of upgrading Fedora versions, partly because the post-upgrade steps have gotten complicated enough. Although now that I'm writing this, I have to admit that my current checklists are only for the post-upgrade parts. I should update the checklist to include the pre-upgrade and during upgrade parts, especially since I have in the past forgotten to do some of them.

Comments on this page:

God bless you for this posting. I usually make checklists for sysadmin work, and I know for a fact how life-saving they can be, but I still often find poor justifications for not doing it "this time". Formal recognition that it's not just me, that it's really best-practice, will help me to do that less in future.

I feel this. I botched some software RAID work a few years ago, which ended up leading to re-installing my desktop/server. On the plus side, I figured "well, I might as well use this as an opportunity to put all of the configuration in Ansible". I'd been putting off using configuration management for a decade or so at that point.

By Nathan Grennan at 2023-04-08 12:45:18:

A counter point is anytime you find yourself thinking about checklists, think about automation instead. Checklists are like HOWTOs. Automation like scripts, configuration management tools, etc are more repeated-able and take human error out. Not to say like all software they too can't be buggy, but with time nad work you can often mostly factor out the bugs.

By cks at 2023-04-12 12:24:02:

There are significant tradeoffs between automation and checklists. One of them is that in automation, you have to explicitly add checks for things going wrong, while with checklists you can to some extent rely on people for that. Optimistic, dangerous automation can be somewhat easy to write, but good automation is much harder. Checklists can often be reasonably written in advance as a one-off, but generally you don't do automation that way; automation is most usually aimed at being reusable.

Written on 23 January 2023.
« How Let's Encrypt accounts are linked to your certificates in Certbot
Linux software RAID mirrors, booting, mdadm.conf, and disk counts for non-fun »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 23 22:31:35 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.