Archival storage in the modern world

September 8, 2011

Today, the following got asked on a university-wide mailing list for sysadmins:

I've had a request from a research lab about the availability of long term (10 years) backups. The amount of data will be roughly 10 - 20T by the end of that period growing at an estimated 1T/yr. [...]

(This isn't really backup, this is archiving.)

My view is the right answer is not to archive the data at all. If you care about long term availability of some data, practically the last thing you want to do is archive it, because reliable archives are hard. Instead you want to keep it on live disks on a live fileserver (using RAID and ideally a filesystem that has data checksums), and just do backups.

(You're doing backups because RAID is not a backup. You're using some sort of checksums on the data so that you can notice corruption before you overwrite the last of your uncorrupt, pre-corruption rolling backups.)

Keeping the data live doesn't guarantee that the data will survive ten years, but provided that you pay attention to the fileserver it does mean that you and your successors will think about at least some of the issues, that you will notice if data starts to degrade, and that you have a chance to recover from problems. If you decide to turn off the fileserver and abandon the data, it will at least be a conscious choice instead of simply failing to notice that you've just gotten rid of (or lost) something that was necessary to recover the archives.

(If you abandon the fileserver in a corner and therefor fail to notice plaintive complaints about dying disks, failing backups, and so on, all bets are off. But there are lots of ways to screw up archival storage too.)

This might sound expensive, but even 20 TB of RAID storage space plus backups is not all that much money and it's getting cheaper all of the time. I wouldn't be surprised if it was cheaper than 20T of ten year archival storage, especially once you factor staff time into it (to research and build a ten year, multi-terabyte archival system). And as a bonus the researchers get to keep all of this historical data online, which may turn out to be useful or at least interesting.

(When costing out the archival system, don't forget to include the cost of redundant archival media so that damage to a single piece of media will not lose data. Even if the media is perfectly reliable, things like fires and accidents happen.)

One situation where this might not be good enough is if the research lab wants archives that cannot be altered after they're made, so that they can be sure that the data they've restored now is the data that they used for the research paper seven years ago and no one has accidentally modified it since then. You may still be able to come up with technological solutions like archival filesystems that you make read-only once data has been loaded onto them.

(This entry is adopted from comments I made on the mailing list, so local people may find that it looks familiar. The issues are generic, though. My earlier entry on the same subject was more oriented towards personal data, instead of this sort of larger scale.)

Written on 08 September 2011.
« How not to set up your DNS (part 21)
Things that could happen to your archives »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 8 01:28:00 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.