In case
yesterday's backup horror story didn't scare you
enough, here's an incomplete list of things that have been known to go
wrong with backups. Are you sure that none of them are happening to
your backups right now?
- the backup program writes corrupted backups.
- the backup program doesn't capture a usable system state because
things keep changing even as it runs (databases are famous for
this).
- the backup program generates incomplete backups, especially when run
in incremental mode. For example, many Unix systems have historically
had problems backing up renamed files or renamed directories.
- the backup program is not noticing or complaining enough about disk
read errors. (This happened to us. We lost some somewhat valuable
historical files.)
- you're not actually backing up everything important on the
machine. (Especially common on Unix systems if you add a filesystem
and forget to tell the backup system about it. And again, this has
happened to us.)
- despite having set them up, you're not actually doing backups;
a cron job has broken, someone is forgetting to run a necessary
command, etc etc. (Lazy people happen a lot.)
- backup media errors are being ignored.
- things don't properly notice or handle the backups hitting the end
of the media. (Embarrassingly, I once did this too; honestly, I
thought that the tape robot automatically advanced to the next
tape when it hit end-of-tape...)
- your tape drive is failing to properly write the tapes.
- your tape robot and/or backup system is not actually advancing to
the next tape, it's just overwriting the same tape over and over
without it or you realizing it.
- your backup system is accidentally overwriting the backup media
instead of appending new data to the end of it. (This has been so
common that
the Amanda backup system refuses to
append to tapes to eliminate the possibility of this happening.)
- your backup tapes are not getting properly rotated. (This is a famous
'lazy people' issue, where the minimum-wage worker you hired hasn't
bothered to actually change tapes.)
- your tape drive has drifted out of proper alignment; while it can
read back tapes that it wrote, nothing else can. Woe strikes if
(or when) you have to replace it or it gets repaired. (Exabyte tape
drives used to be infamous for this.)
- your tape drive isn't being made any more. If it breaks, can you get
another one that can read your backup tapes back?
- your backup system's index files that tell you what backups are on
what tapes are not being backed up.
- your backup system's index files are being backed up, but you don't
know to where without the index files.
- backups can only be restored by a program running on the same
operating system (and architecture) that made them. Don't lose your
last machine of that OS + architecture combination!
- your commercial backup system requires a node-locked license even to
restore files. If you lose the backup server, can you easily run the
software on another machine?
- your restore program has bugs, although the backups themselves are
fine. (This has happened to us. It's at least somewhat fixable.)
- your offsite backups aren't.
- your offsite backups aren't recent enough.
While some of these are very hard to check for, the only way in general
to be confidant that they aren't quietly happening to you is to test
restoring from your backups periodically. Backups really need an
end-to-end test every so often.
(Feel free to add more in comments, of course. Note that I'm pretty much
focusing on things that could be quietly going wrong in your (low-level)
backup system itself, as opposed to all the additional problems that you
can have in a disaster-recovery situation.)