The hassles today of having servers with disks that can't be hot-swapped

February 3, 2022

We had a disk failure today in one of our 1U servers with disks that aren't hot-swappable, and the resulting discussion over what to do about it made me start thinking of how disks that can't be hot-swapped are an extra hassle these days. Some of these extra problems weren't obvious to me until now, so it's time for some things learned.

The direct problem when you have a disk failure in a server with disks you can't hot-swap is that you have to take the server down to replace the disk. If it's a server that does anything visible to people, you may well have to schedule this downtime for some out of hours time. In addition, swapping out a disk in one of these servers tends to be a time consuming affair where you generally need to open up the server, extract things, and fiddle around with screws to get the old drive out of the carrier and the new drive in to it. The time the swap requires makes it less likely that we'll be able to do it at a flexible time during the day, the way we might if it was only a minute or two.

(Some machines are so visible that they can't be down for even a minute without it being a clear interruption, but other machines are bit less so. You might get away with this for a DHCP server, for example.)

When we were in the office on a regular and ongoing basis, all of this felt tolerable. It wasn't ideal to come in early or stay late, but it wasn't a big change from normal (and it didn't come with extra risks). Having several people be there at the same time was generally pretty straightforward, which made it easier (and faster) to wrangle hardware. And disks didn't fail all that often.

That's not the case when we're working from home, or at least it doesn't feel like it's the case any more. Scheduling is harder, people aren't naturally all there to start with, scheduled downtimes are no longer just an extension of a normal day at the office, and time in the office is more precious because it's more limited (so the ten or twenty minutes matter even if you can take down a non-visible server in the middle of the day).

A server with hot swap drive bays doesn't eliminate all issues with a disk failing, because someone still has to go in to the office. But it reduces a lot of them. There's no downtime, no scheduling, and it's a lot faster, making it a lot easier to do in passing or as a small thing, perhaps as part of a brief lightning visit.

Another way I've come to think of it is that us working from home magnifies the frictions that were always there. Some of the frictions are present even with hot-swap disks, and now that I'm aware of this I can see this in our (temporary) move to three way mirrors on a number of machines.

Written on 03 February 2022.
« Disk drives can have weird SMART values for their power on hours
Some notes on Grafana relative time ranges »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 3 22:07:57 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.