Our new plan for creating our periodic long term backups
Our ordinary backups are done on the usual straightforward rolling basis, where we aim to have about 60 days worth of backups. We also try to make an additional set of long term backups every so often, currently roughly three times a year, and keep these for as long as possible. Every so often this makes people very happy because we can restore something they deleted six months ago without noticing.
Our long term backups are done with the same basic system as our regular disk-based backups. We have some additional Amanda servers that are used only for these long term backups, we load them up with disks, and then we have them do full backups of all of our filesystems to the spare disks. Obviously this requires careful scheduling and managing, since we don't want to collide with the regular backups (which take priority). This is a simple approach and it works, but unfortunately over time it's become increasingly difficult and time consume to actually do a long term backup run. The long term backups can only run during the day and require hand attention, sometimes the regular backups of our largest fileserver run into the day and block long term backups that day entirely, the daytime backups go very slowly in general because our systems are actively in use, and so on. And many of these problems are only going to get worse in the future, as people use more space and are more active on our machines.
Recently, one of my co-workers had a great idea on how to deal with all of these problems: copy filesystem backups out of our existing Amanda servers. Instead of using additional Amanda servers to do additional backups, we can just make copies of the full filesystem backups from our existing regular backup system. When you do Amanda backups to 'tapes' that are actually disks, Amanda just writes each filesystem backup to a regular file. Want an extra copy, say for long term backups? Just copy it somewhere, say to the disks we're using for those long term backups. This copying doesn't bog down our fileservers, can easily be done when the Amanda servers are otherwise idle, and can be done any time we want, even days after the filesystem full backup was actually made. Effectively we've turned building the long term backups from a synchronous process into an asynchronous one.
The drawback of abandoning Amanda is that we lose all of the Amanda
infrastructure for tracking where filesystems have been saved and
restoring filesystems (and files). It's entirely up to us to keep
track of which disk has which filesystem backup (and when it was
made) and to save per-filesystem index files. And any restores will
have to be entirely done by hand with raw
which makes them rather less convenient. But we think we can live
with all of this in exchange for it being much easier to make the
long term backups.
Right now this is just a plan. We haven't done a long term backup run with it; the next one is likely to happen in September or October. We may find out that there are some unexpected complications or annoyances when we actually try it, although we haven't been able to think of any.
(In retrospect this feels like an obvious idea, but much like the last time spotting it requires being able to look at things kind of sideways. All of my thoughts about the problem were focused on 'how can we speed up dumping filesystems for the long term backups' and 'how can we make them work more automatically' and so on, which were all stuck inside the existing overall approach.)