Indexed archive formats and selective restores

January 13, 2024

Recently we discovered first that the Amanda backup system has to read some tar archives all the way to the end when restoring a few files from them and then sometimes it can do quick restores from tar archives. What is going on is the general issue of indexed (archive) formats, and also the potential complexities involved in them in a full system.

To simplify, tar archives are a series of entries for files and directories. Tar archives contain no inherent index of their contents (unlike some archive formats, such as ZIP archives), but you can build an external index of where each file entry starts and what it is. Given such an index and its archive file on a storage medium that supports random access, you can jump to only the directory and file entries you care about and extract only them. Because tar archives have not much special overall formatting, you can do this either directly or you can read the data for each entry, concatenate it, and feed it to 'tar' to let tar do the extraction.

(The trick with clipping out the bits of a tar archive you cared about and feeding them to tar as a fake tar archive hadn't occurred to me until I saw what Amanda was doing.)

If tar was a more complicated format, this would take more work and more awareness of the tar format. For example, if tar archives had an internal index, either you'd need to operate directly on the raw archive or you would have to create your own version of the index when you extracted all of the pieces from the full archive. Why would you need to extract the pieces if there was an internal index? Well, one reason is if the entire archive file was itself compressed, and your external index told you where in the compressed version you needed to start reading in order to get each file chunk.

The case of compressed archives shows that indexes need to somehow be for how the archive is eventually stored. If you have an index of the uncompressed version but you're storing the archive in compressed form, the index is not necessarily of much use. Similarly, it's necessary for the archive to be stored in such a way that you can read only selected parts of it when retrieving it. These days that's not a given, although I believe many remote object stores support HTTP Range requests at least some of the time.

(Another case that may be a problem for backups specifically is encrypted backups. Generally the most secure way to encrypt your backups is to encrypt the entire archive as a single object, so that you have to read it all to decrypt it and can't skip ahead in it.)

Written on 13 January 2024.
« What we use ZFS on Linux's ZED 'zedlets' for
Git branches as a social construct »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 13 23:28:30 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.