When we take things out of service, I should write up how they worked

July 21, 2023

Recently I wrote about the persistence of 'san' names over several generations of our fileservers, even though we don't have a SAN any more. In the process of writing that entry, I discovered that I'd forgotten various details of how our first generation and second generation of ZFS fileservers worked and how we talked about them, and a lot of details about our pre-ZFS fileserver environment. At the time we were operating each of these environments, all of this was obvious to us so we didn't write it down in one place anywhere (it was at best scattered across various worklog entries, but our worklog entries assume a lot of context). Then once we stopped running each generation of fileservers, we stopped having any need to write down any 'how it works' documentation, even scattered stuff.

I've come to think that might be a mistake, since there have been more than a one or two times when we've wanted to look back to see how we did something in the past. Now, I think that when we take some significant system out of service, we should strongly consider doing an an 'end of service how it worked' writeup. We should do this right after the system goes out of service, while we still vividly remember all of the details of how it worked, but at the same time it's not going to change (because it's out of service, which is the drawback of writing 'how it works' documentation while something is in service (ask me how I know)).

(Sometimes we do have 'principles of operation/how it works' documentation, but this is relatively uncommon. There's a relatively low payoff to writing this documentation for currently active systems that we all work on and know, because we currently know all of it anyway.)

It's tempting (and labour-saving) to think that we don't really need this writeup if the new replacement system is very close to the old one. I think this is a mistake. At a minimum we should write up what's different and probably do a very sketch outline of it, but an actual writeup would be better. It's more self-contained and doesn't need us to find the next writeup (of the new replacement system when we take that out of service).

(Sometimes when we take something out of service we're very glad to see the last of it, which sort of makes it hard to motivate ourselves to write down how it worked. It would be nice if I could be sure we'd never care later on about how those systems worked, but I doubt I can. Maybe those 'we didn't like it but we had to run it' systems are the most interesting, because after all something important kept us running them.)

Written on 21 July 2023.
« Command hashing in Unix shells is probably no longer worth it
I'm not going to accurately remember our past views and thoughts »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 21 22:27:12 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.