2023-11-13
Amanda has clever restores from tar archives (sometimes)
Yesterday I wrote an entry about how the Amanda backup system reads all the way through tar archives on restores. Except, it turns out, this is only partially true, because under some circumstances Amanda will thoroughly optimize restores from tar archives (and possibly other archive formats). When everything is lined up right, what you'll observe is that Amanda reads only the data being actually restored, however little or much it is, and as a result restores of small files out of large backups can be quite fast. The situation where we've seen this happen is uncompressed backups made using the amgtar backup application.
Under normal circumstances, Amanda tries to make an index of every backup that says what files and directories are in it. These indexes are used when you use amrecover to look around in a backed up filesystem and do a restore of a single file, for example. Under some circumstances, Amanda will build indexes of tar archives that list not just each name in the archive but also where it starts in the archive (and implicitly where it ends, based on the start of the next item). When you do a restore and the backup blob is on disk (and so can be seek'd around in), Amanda on your backup server will use this index to send just the pieces of the archive that are needed to the machine you're running amrecover on, where they get reconstructed into an apparent tar archive and then fed to tar to be extracted. Since tar is only being fed things that it should extract, it doesn't matter that tar itself wants to read all the way through the archive.
Making this work relies on a lot of things, including that the format of tar archives makes it simple to cut them apart and glue them back together again. A more complex backup format would give Amanda much more heartburn (if, for example, it started with an index of the rest of the data). I also don't know if Amanda will do anything to accelerate restores if it's reading from tape (or a non-seekable source in general). In theory it could at least stop after it's read everything necessary, and it wouldn't have to ship everything to the client.
This doesn't work for compressed tar backups for two reasons. The obvious reason is that Amanda doesn't have any index for where files are in the compressed version of the tar archive, so it can't skip to them and clip them out. The broader reason is that most compression formats don't normally allow you to seek arbitrarily in them, because compression (and decompression) rely on context, which is built and maintained by reading through all of the file or at least large blocks of data.
(Some compressors do, though; in a comment on the first entry, vasi pointed to his pixz, which can create indexed compressed tar archives. I've also found t2sz, which is an indexed zstd compressor that understands tar archives. We use zstd for our compressed backups, but I don't know if it would be particularly easy to wire this up to Amanda.)