Larger backup systems often operate in multiple stages

July 8, 2022

At the small scale, backups are usually straightforward (although not always simple). As you get into larger and more complicated environments, like ours, things can get more tangled. One of the ways that backup systems can do this is that they often operate in stages, or if you prefer phases, where different sorts of things happen in different stages.

As a concrete example, Amanda normally operates in two overlapping stages. First, Amanda makes the actual backups from your various systems and streams them onto one or more 'holding disks' (I called this the staging disk in this entry). Second, Amanda writes completed backups from the holding disk on to whatever you're using as 'tape' (in our case, HDDs), removing each completed backup from the holding disk after it's done.

(A more complicated backup environment than ours might have a third significant stage, where you prepare for the backup by doing things like pausing a database to take a coherent dump of it. Amanda itself actually has an 'estimation and scheduling' stage at the start, which I've ignored here.)

There are various good current and past reasons for backup systems to divide things up this way. For example, if you're using actual tape to store your backups, it's a sequential access medium; only one backup can be writing to tape at a time. You probably want to do the actual backups in parallel to speed things up.

This division of your backup into different stages matters because different stages usually have different effects on your environment. For example, Amanda can put significant load on our fileservers and use a bunch of network bandwidth when it makes backups, but that load and bandwidth usage is over once the backups have been written to the holding disk. After that, only the Amanda servers are affected by writing things out to the HDDs that are our 'tape'.

Because of this, what stage a backup spends time in can matter as much as how much time it takes in total (although obviously you need the total backup to finish before the point where you want to start the next one). Often you would rather have a backup that takes eight hours where only the first hour affects your production systems, instead of a backup that takes five hours but affects your production systems for four of them. When designing, building, and tuning a backup system, you can easily wind up with tradeoffs of where to spend money or optimize that affect not just the total time taken, but the division of time between stages.

(And then sometimes you run into inconvenient limits on what common hardware can do.)


Comments on this page:

For example, if you're using actual tape to store your backups, it's a sequential access medium; only one backup can be writing to tape at a time. You probably want to do the actual backups in parallel to speed things up.

There is some subtlety to the "only one backup" point, specifically multiplexing, where multiple clients can send data to a single tape drive:

So if you have three clients (A, B, C) and no multiplexing, their blocks are AAAAAABBBBBBCCCCCCC. Whereas if multiplexing is used you get interleaving of blocks: AAABBCCCABCCACBACBA. This allows one to use the full bandwidth of the tape drive during writing of backups at the cost of slower reads during restores (or copying/cloning) because there are regions of the tape that need to be skipped.

I think multiplexing was more of a thing when people used to backup straight from the client to the tape without any intermediary stages/phases, and any disk head seeking or network hiccup would slow the transfer. It's been best practice for a number of years to backup to disk first, and then move the bits to tape later, which would generally reduce random I/O and head seeks.

Written on 08 July 2022.
« DKIM signature types (algorithms) that we see (as of July 2022)
The Linux load average does mean something (although maybe not much) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 8 22:33:07 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.