Our current approach for significantly upgrading or modifying servers

March 29, 2019

Every so often we need to make some significant upgrade or change to one of our servers, for instance to upgrade from Ubuntu version to Ubuntu version. When we do this, we do two things. The first is that we reinstall from scratch rather than try to upgrade the machine's current OS and setup in place. There are a whole bunch of reasons for this (for any OS, not just Linux), including that it gets you as close as possible to insuring that the current state of the machine isn't dependent on its history.

(A machine that has been through major upgrades inevitably and invariably carries at least some traces of its past, traces that will not be there on a new version that was reinstalled from scratch.)

The second is that we almost always install the new instance of the server on new hardware and swap it into place, rather than reinstalling on the same hardware that is currently the live server. There are exceptions, usually for our generic compute servers, but for anything important we prefer new hardware (this is somewhat of a change from the past). One part of this is that using a new set of hardware makes it easy to refresh the hardware, change the RAM or SSD setup, and so on (and also to put the new server in a different place in your racks). Another part is that when you have two servers, rolling back an upgrade that turns out to have problems is much easier and faster than if you have destroyed the old server in the process of installing the new one. A third reason is more prosaic; there's always less downtime involved in a machine swap than in a reinstall from scratch, and among other things this leads to less or no pressure when you're installing the machine.

One consequence of our approach is that we always have a certain amount of 'not in production' replaced servers that are still in our racks but powered off and disconnected. We don't pull replaced servers immediately, in case we have to roll back to them, so after a while we have to remember that probably we should pull the old version of an upgraded server. We don't always, so every so often we basically wind up weeding our racks, pulling old servers that don't need to be there. One trigger for this weeding is when we need room in a specific rack and it happens to be cluttered up with obsolete servers. Another is when we run short on spare server hardware to turn into more new servers.

(Certain sorts of servers are recycled almost immediately in order to reclaim desirable bits of hardware in them. For example, right now anything with a 10G-T card is probably going to be pulled shortly after an upgrade in order to extract the card, because we don't have too many of them. There was a time when SSDs would have prompted recycling, but not any more.)

PS: We basically never throw out (still) working servers, even very old ones, but they do get less and less desirable over time and so sit deeper and deeper in the depths of our spare hardware storage. The current fate of really old servers is mostly to be loaned or passed on to other people here who need them and who don't mind getting decade old hardware (often with very little RAM by modern standards, which is another reason they get less desirable over time).

PPS: I'm not joking about decade old servers. We recently passed some Dell 1950s on to someone who needed scratch machines.

Written on 29 March 2019.
« My NVMe versus SSD uncertainty (and hesitation)
Erasing SSDs with blkdiscard (on Linux) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Mar 29 21:48:13 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.