My perspective on why we do in-place reinstalls of machines
Given what I mentioned yesterday, you might wonder why we are doing what I called 'in-place' reinstalls of machines, where we reinstall a machine with the name and IP address that it will use in production. From my perspective there are two or three reasons for this.
The first reason, the big reason, is that we've run out of spare hardware. In previous upgrades we installed the new version of a machine on completely new hardware, got it running, and then switched everything around during the 'upgrade' downtime; the old version of the machine got renamed or powered down (and was later reused for something else) and the OS install on the new hardware got itself renamed and so on. This was kind of a pain but it was worth it for genuinely fast and hassle-free switchovers. But this requires a bunch of spare servers and we've steadily used most of them up.
(Some of the 'used' servers are actually just reserved for certain future uses, but most of them actually running in production.)
The other reason or two boils down to 'we're just lazy enough to take the risks'. In theory we could (re)install the new version of a machine under a temporary name and IP address, get it almost fully up, and then switch its name, IP address, and so on to the production one(s); we could do this either on the production hardware or on another identical server and then move the disk(s) over to the production hardware. In practice it's just enough of an extra pain to install machines under temporary names and then rename them (and re-IP them, remember to give them all of the IP aliases, and so on) that we're willing to take the risks of an in-place reinstall. Using another server for the initial install and moving the disks afterwards generally adds an extra layer of pain to the process, mostly because common operating systems are increasingly binding things tightly to the specific hardware they happen to be on at the moment; when you move the disks, you get to find and fix all of these things too.
(When we don't have spare hardware, it should also be noted that a genuine in-place reinstall results in a somewhat shorter downtime and requires fewer manual steps that can blow up in our face. This is probably a reasonable tradeoff.)
It's worth noting that you should only have this issue in the kind of infrastructure we do, or at least the kind of infrastructure where people and services are talking directly to machines by name or IP address and are not indirecting through a load balancer or any other sort of directory service. If you have an indirection step, taking a machine in or out of production service should be a trivial step that's independent of installing the machine and you should effectively never have in-place reinstalls of anything except perhaps the load balancers and directory servers themselves.