Larger backup systems often operate in multiple stages

July 8, 2022

At the small scale, backups are usually straightforward (although not always simple). As you get into larger and more complicated environments, like ours, things can get more tangled. One of the ways that backup systems can do this is that they often operate in stages, or if you prefer phases, where different sorts of things happen in different stages.

As a concrete example, Amanda normally operates in two overlapping stages. First, Amanda makes the actual backups from your various systems and streams them onto one or more 'holding disks' (I called this the staging disk in this entry). Second, Amanda writes completed backups from the holding disk on to whatever you're using as 'tape' (in our case, HDDs), removing each completed backup from the holding disk after it's done.

(A more complicated backup environment than ours might have a third significant stage, where you prepare for the backup by doing things like pausing a database to take a coherent dump of it. Amanda itself actually has an 'estimation and scheduling' stage at the start, which I've ignored here.)

There are various good current and past reasons for backup systems to divide things up this way. For example, if you're using actual tape to store your backups, it's a sequential access medium; only one backup can be writing to tape at a time. You probably want to do the actual backups in parallel to speed things up.

This division of your backup into different stages matters because different stages usually have different effects on your environment. For example, Amanda can put significant load on our fileservers and use a bunch of network bandwidth when it makes backups, but that load and bandwidth usage is over once the backups have been written to the holding disk. After that, only the Amanda servers are affected by writing things out to the HDDs that are our 'tape'.

Because of this, what stage a backup spends time in can matter as much as how much time it takes in total (although obviously you need the total backup to finish before the point where you want to start the next one). Often you would rather have a backup that takes eight hours where only the first hour affects your production systems, instead of a backup that takes five hours but affects your production systems for four of them. When designing, building, and tuning a backup system, you can easily wind up with tradeoffs of where to spend money or optimize that affect not just the total time taken, but the division of time between stages.

(And then sometimes you run into inconvenient limits on what common hardware can do.)

Written on 08 July 2022.
« DKIM signature types (algorithms) that we see (as of July 2022)
The Linux load average does mean something (although maybe not much) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 8 22:33:07 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.