Our broad reasons for and approach to mirroring disks

September 20, 2024

When I talked about our recent interest in FreeBSD, I mentioned the issue of disk mirroring. One of the questions this raises is what we use disk mirroring for, and how we approach it in general. The simple answer is that we mirror disks for extra redundancy, not for performance, but we don't go too far to get extra redundancy.

The extremely thorough way to do disk mirroring for redundancy is to mirror with different makes and ages of disks on each side of the mirror, to try to avoid both age related failures and model or maker related issues (either firmware or where you find out that the company used some common problematic component). We don't go this far; we generally buy a block of whatever SSD is considered good at the moment, then use them for a while, in pairs, either fresh in newly deployed servers or re-using a pair in a server being re-deployed. One reason we tend to do this is that we generally get 'consumer' drives, and finding decent consumer drives is hard enough at the best of times without having to find two different vendors of them.

(We do have some HDD mirrors, for example on our Prometheus server, but these are also almost always paired disks of the same model, bought at the same time.)

Because we have backups, our redundancy goals are primarily to keep servers operating despite having one disk fail. This means that it's important that the system keep running after a disk failure, that it can still reboot after a disk failure (including of its first, primary disk), and that the disk can be replaced and put into service without downtime (provided that the hardware supports hot swapping the drive). The less this is true, the less useful any system's disk mirroring is to us (including 'hardware' mirroring, which might make you take a trip through the BIOS to trigger a rebuild after a disk replacement, which means downtime). It's also vital that the system be able to tell us when a disk has failed. Not being able to reliably tell us this is how you wind up with systems running on a single drive until that single drive then fails too.

On our ZFS fileservers it would be quite undesirable to have to restore from backups, so we have an elaborate spares system that uses extra disk space on the fileservers (cf) and a monitoring system to rapidly replace failed disks. On our regular servers we don't (currently) bother with this, even on servers where we could add a third disk as a spare to the two system disks.

(We temporarily moved to three way mirrors for system disks on some critical servers back in 2020, for relatively obvious reasons. Since we're now in the office regularly, we've moved back to two way mirrors.)

Our experience so far with both HDDs and SSDs is that we don't really seem to have clear age related or model related failures that take out multiple disks at once. In particular, we've yet to lose both disks of a mirror before one could be replaced, despite our habit of using SSDs and HDDs in basically identical pairs. We have had a modest number of disk failures over the years, but they've happened by themselves.

(It's possible that at some point we'll run a given set of SSDs for long enough that they start hitting lifetime limits. But we tend to grab new SSDs when re-deploying important servers. We also have a certain amount of server generation turnover for important servers, and when we use the latest hardware it also gets brand new SSDs.)

Written on 20 September 2024.
« OpenBSD versus FreeBSD pf.conf syntax for address translation rules
TLS certificates were (almost) never particularly well verified »

Page tools: View Source.
Search:
Login: Password:

Last modified: Fri Sep 20 22:51:03 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.