Why reproducible machines didn't used to be a priority (I think)

July 2, 2022

When I wrote about why we care about being able to (efficiently) reproduce machines, I mentioned that this didn't used to be a priority in the sufficiently old days. By the sufficiently old days I mean broadly the 1990s, at least in more moderate sized environments like what used to be common at the university; I think by even the early 00s, people here were starting to care. Today it's time for my rambles about some reasons why I think we didn't used to care as much.

First, back then it was often harder to keep machines running in the first place. Software was more primitive and hardware was sufficiently limited that we were pushing it much closer to the edge than we did later. If much of your time is spent fighting fires, everything else takes second place. This feeds into the other two reasons.

(One area of software that was more primitive was the software for automatically managing machines.)

Second, often we didn't have enough machines to care about it; we might only have one machine that could do a given function (or need all of the machines we had). To a certain point, the push for having reproducible machines comes from having more machines to reproduce the first ones on to, whether these new machines are spare physical servers or new virtual ones. Also, to some extent the payoff in time savings for getting to reproducible machines increases as you have more and more machines. If you only have two or three servers, putting together a big automation framework may not be worth it (especially if the state of software means that you have to build a good chunk of it).

Third and related to the second issue, technology and software was often changing fast enough than if we needed to replace a machine, the replacement hardware and software would be different enough that we'd be mostly starting from scratch. Even within the university, this was somewhat variable depending on what hardware you tended to buy; for example, there were places that were Solaris SPARC shops that ran Solaris for over a decade, and I don't think Solaris changed hugely over that time. In that sort of environment you could reuse a fair amount of configuration from (server) generation to generation.

(At the same time, I know another environment that went from DEC MIPS Ultrix to SGI (MIPS) IRIX to x86 Linux over the course of a bit over a decade, with relatively little configuration reuse from each one to the next.)

This didn't mean that we weren't worried about losing a machine to hardware failure (or disasters). Instead, our plans for recovering from that sort of problem generally involved full system restores in some fashion. How well that would have worked is an open question, since usually we didn't have the spare hardware or the other things that would have been required to do a full end to end test.

(We could and did recover individual things from backups (well, mostly, also), we just mostly never did full system restores because server failures were very uncommon.)

All of this is somewhat of a reconstruction of what I and other people were thinking at the time, or not thinking (and it's a reconstruction from a long way away in time). Part of it may have simply been that the idea of reproducible machines wasn't really out there yet, especially as a realistic possibility for relatively small environments.

PS: One of my feelings on this issue is that as a practical matter, you need to actually test your documentation, procedures, and software for reproducing machines. Doing this sort of testing is much easier in a modern environment with spare servers and virtual machines.

Written on 02 July 2022.
« A quiet shift in what tech people build for their blogs
Filesystems versus general tree structures »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Jul 2 21:17:32 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.