How we failed at making all our servers have SSD system disks
Several years ago I wrote an entry about why we're switching to SSDs for system disks, yet the other day there I was writing about how we recycle old disks to be system disks and maybe switching to fixed size root filesystems to deal with some issues there. A reasonable person might wonder what happened between point A and point B. What happened is not any of the problems that I thought might happen; instead it is a story of good intentions meeting rational but unfortunate decisions.
The first thing that happened was that we didn't commit whole-heartedly to this idea. Instead we decided that even inexpensive SSDs were still costly enough that we wouldn't use them on 'less important' machines; instead we'd reuse older hard drives on some machines. This opened a straightforward wedge in our plans, because now we had to decide if a machine was important enough for SSDs and we could always persuade ourselves that the answer was 'no'.
(It would have been one thing if we'd said 'experimental scratch machines get old HDs', but we opened it up to 'less important production machines'.)
Our next step was that we didn't buy (and continue to buy) enough SSDs to always clearly have plenty of SSDs in stock. The problem here is straightforward; if you want to make something pervasive in the servers that you set up, you need to make it pervasive on your stock shelf, and you need to establish the principle that you're always going to have more. This holds just as true for SSDs for us as it does for RAM; once we had a limited supply, we had an extra reason to ration it, and we'd already created our initial excuse when we decided that some servers could get HDs instead of SSDs.
Then as SSD stocks dwindled below a critical point, we had the obvious reaction of deciding that more and more machines weren't important enough to get SSDs as their system disks. This was never actively planned and decided on (and if it had been planned, we might have ordered more SSDs). Instead it happened bit by bit; if I was setting out to set up a server, and we had only (say) four SSDs left, I have to decide on the spot if my server is that important. It's easy to talk myself into saying 'I guess not, this can live with HDs', because I have to make a decision right then on the spot in order to keep moving forward on putting the server together.
(Had we sat down to plan out, say, our next four or five servers that we were going to build and talked about which ones were and weren't important, we might well have ordered more SSDs because the global situation would have been clearer and we would have been doing this further in advance. On the spot decision making is not infrequently driven to be focused on the short term and the immediate perspective, instead of a long term and global one.)
At this point we have probably flipped over to a view that HDs are the default on new or replacement servers and a server has to strike us as relatively special to get SSDs. This is pretty much the inverse of where we started out, although arguably it's a rational and possibly even correct response to budget pressures and so on. In other words, maybe our initial plan was always over-ambitious for the realities of our environment. It did help, because we got SSDs into some important servers and thus we've probably made them less likely to have disk failures.
A contributing factor is that it turned out to be surprisingly annoying to put SSDs in the 3.5" drive bays in a number of our servers, especially Dell R310s, because they have strict alignment requirements for the SATA and power connectors, and garden variety SSD 2.5" to 3.5" adaptors don't put the SSDs at the right place for this. Getting SSDs into such machines required extra special hardware; this added extra hassle, extra parts to keep in stock, and extra cost.
(This entry is partly in the spirit of retrospectives.)