Wandering Thoughts archives

2023-04-02

You should automate some basic restore testing of your backups

Recently on the Fediverse, Ben Zanin said this important thing:

Yesterday was world backup day!

Please remember that the value of your backups is determined by the cadence of your successful restoration tests.

Please also remember that when backing up your enterprise applications you need to actually follow vendor admonitions about synchronizing your database and filesystem snapshots, keeping your backups across an air gap, and never connecting your system network to your management network. These are not someone else's problems, they are yours.

💖

Everyone says that you don't really have backups until you've tested that you can restore them (and I have some hair-raising stories about that, where I was lucky to avoid a disaster). The corollary to this is that it's a really good idea to automate at least a basic test of reading your backups and perhaps restoring something from them. This doesn't have to be particularly elaborate. For example, you could have a few canary files in various places where you write the current day's date before you start backups, and then afterward you can try to restore a random canary file and see if it has the right date.

How this will work depends a lot on how your backup system works and how much of an end to end test you want to do. Locally, we use Amanda to organize tar based backups of both our fileservers and various other servers; the backups are written to disk DiskBackupSystemII]] instead of tape for reasons beyond the scope of this entry. Our backups get a certain amount of testing through reasonably regular restore requests from people, but we don't see that all that often. So when we started thinking about this, my co-workers built a two stage system.

Our first stage of automated 'restore' testing is that each backup server picks a random backup and attempts to read all the way through it with 'tar -tf'. This only checks for very basic problems, but just doing this provides a certain amount of assurance that things haven't gone terribly wrong. Our second stage of automated restore testing is that we have canary files with known contents that are backed up by each of our backup servers (in a forced full backup every day). Every day, we extract each backup server's canary file from its backups and verify that the contents are right.

(I've elided some details about our setup for reasons.)

This is obviously not comprehensive testing. But even basic testing like this is valuable because it will pick up pervasive, large issues. If tar starts writing corrupted archives, for example, we'll find out pretty soon. This matters because one of the classical failure modes of backups is a total failure to be restorable; your backups didn't happen, or they were thrown away, or they were corrupt, or they didn't actually back up anything. Even a very basic test of 'can you pull a single canary file from your backups' or 'are the low level backups formatted correctly' will detect this sort of pervasive failure. And automating this sort of very basic test will insure that you find out about the issue promptly. Don't let the perfect be the enemy of the good.

(Of course if you have the capacity and especially if it's important enough, automate larger scale end to end restoration tests. But don't let doing those stop you from starting with simple, easy to do things.)

sysadmin/AutomateSomeBackupRestoreTests written at 22:01:04;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.