Wandering Thoughts archives

2005-08-16

Things that could happen to your backups

In case yesterday's backup horror story didn't scare you enough, here's an incomplete list of things that have been known to go wrong with backups. Are you sure that none of them are happening to your backups right now?

  • the backup program writes corrupted backups.
  • the backup program doesn't capture a usable system state because things keep changing even as it runs (databases are famous for this).
  • the backup program generates incomplete backups, especially when run in incremental mode. For example, many Unix systems have historically had problems backing up renamed files or renamed directories.
  • the backup program is not noticing or complaining enough about disk read errors. (This happened to us. We lost some somewhat valuable historical files.)
  • you're not actually backing up everything important on the machine. (Especially common on Unix systems if you add a filesystem and forget to tell the backup system about it. And again, this has happened to us.)
  • despite having set them up, you're not actually doing backups; a cron job has broken, someone is forgetting to run a necessary command, etc etc. (Lazy people happen a lot.)

  • backup media errors are being ignored.
  • things don't properly notice or handle the backups hitting the end of the media. (Embarrassingly, I once did this too; honestly, I thought that the tape robot automatically advanced to the next tape when it hit end-of-tape...)
  • your tape drive is failing to properly write the tapes.
  • your tape robot and/or backup system is not actually advancing to the next tape, it's just overwriting the same tape over and over without it or you realizing it.
  • your backup system is accidentally overwriting the backup media instead of appending new data to the end of it. (This has been so common that the Amanda backup system refuses to append to tapes to eliminate the possibility of this happening.)
  • your backup tapes are not getting properly rotated. (This is a famous 'lazy people' issue, where the minimum-wage worker you hired hasn't bothered to actually change tapes.)
  • your tape drive has drifted out of proper alignment; while it can read back tapes that it wrote, nothing else can. Woe strikes if (or when) you have to replace it or it gets repaired. (Exabyte tape drives used to be infamous for this.)
  • your tape drive isn't being made any more. If it breaks, can you get another one that can read your backup tapes back?

  • your backup system's index files that tell you what backups are on what tapes are not being backed up.
  • your backup system's index files are being backed up, but you don't know to where without the index files.
  • backups can only be restored by a program running on the same operating system (and architecture) that made them. Don't lose your last machine of that OS + architecture combination!
  • your commercial backup system requires a node-locked license even to restore files. If you lose the backup server, can you easily run the software on another machine?
  • your restore program has bugs, although the backups themselves are fine. (This has happened to us. It's at least somewhat fixable.)

  • your offsite backups aren't.
  • your offsite backups aren't recent enough.

While some of these are very hard to check for, the only way in general to be confidant that they aren't quietly happening to you is to test restoring from your backups periodically. Backups really need an end-to-end test every so often.

(Feel free to add more in comments, of course. Note that I'm pretty much focusing on things that could be quietly going wrong in your (low-level) backup system itself, as opposed to all the additional problems that you can have in a disaster-recovery situation.)

sysadmin/PotentialBackupProblems written at 00:49:20; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.