Our advantage in reliable backups is that we get restore requests
Recently Gitlab had a data loss incident where part of the problem was that, to quote their incident report:
- So in other words, out of five backup/replication techniques deployed none are working reliably or set up in the first place. [...]
Like many people, this set me to thinking about the perennial issue of quietly broken backups and how we're doing on that score (since there's nothing like someone else's catastrophe to make you wonder how you'd do in a similar situation). One of my realizations here is that we have a subtle advantage over many other organizations. That advantage masquerades as what some people would see as a drawback; it is that we get asked to restore accidentally deleted files on a reasonably routine basis.
In a lot of places, backups are only for disaster recovery. If you have users and the users delete something thoroughly enough that it defeats all your chances to give them a chance to change their minds, well, that thing is gone. Sure, you could theoretically spin up the backups, pull out a database backup that should have the deleted thing, restore the database to a scratch system, extract the deleted data, and add it back to the live database, but in practice you're not going to do that short of an extreme case that involves someone high up in the organization making a very special exemption. In these places your backups are completely untested unless you go out of your way to check them.
But we serve a general user population keeping (Unix) files on (Unix) fileservers, and sometimes they delete their files by accident and want them back, and our backups exist in part for that. So we restore them on request, with no exceptional permission needed or anything. The mere process of satisfying routine restore requests of 'I deleted file X sometime Tuesday, please get it back' serves as a basic test of our backup system (and sometimes of our long term archive system as well, depending on how far back they deleted what they want back). We probably get one or two such restore requests a month, so once or twice a month we re-verify that our backups are working as a side effect of handling them.
One consequence of this is that we actively pay attention to the state of our backup system, because we know it's going to be used reasonably regularly. We get daily backup reports and we look for at least outright failures to back things up (and they happen every so often). And in general, tending the backup system is something that's actively on our minds; not as a high priority most of the time, but as a routine ongoing background activity that does need a certain amount of manual attention (and we know it needs that attention, so if it shut up about that we'd probably start wondering what was wrong so that it, eg, wasn't demanding new 'tapes'.
(We don't necessarily look carefully at everything our backup system reports, which can mean some issues linger for a while.)
There are decided limits to this implicit testing we're doing. For a start it's neither systematic nor ongoing, and on top of that we very rarely get asked to restore more than a single file, so we don't really test full restores of even a directory hierarchy very often (and we've never had to test a full restore of a filesystem). Even when we do restore a directory hierarchy, we haven't tested to see if deleted or renamed files are handled correctly; if the restore leaves kind of a mess, that's left for the user and their Point of Contact to sort out.
Possibly we should try to do better than this. On the other hand you can argue that the likely payoff for much better testing is pretty low, since we have a good understanding of the components of our backup system and it doesn't involve anything that's particularly likely to be fragile in ways that Amanda's existing reports wouldn't notice and cover.
(I'm waving my hands here and justifying it is going to require another entry.)