2023-11-10
Backup systems and how much they do or don't know about storage formats
One of the divides in large scale systems for handling backups is whether they have their own custom storage format (or formats) for backups, or whether they rely on outside tools to create what I'll call 'backup blobs' that the backup system then manages. This division is fractal, because sometimes what you're backing up is, for example, database snapshots or dumps, and even if the backup system has its own custom storage format it may well treat the database dump as an opaque blob of a file that it only deals with as a unit. (It's a lot of work to be able to peer inside all of the storage formats you might run into, or even recognize them.)
The advantage of a backup system that relies on other tools is that it doesn't have to write the tools. This has two benefits. First, standard tools for making backups of filesystems and so on are often much more thoroughly tested and hardened against weird things than a new program. Second, if you allow people to specify what tools to use and provide their own, they can flexibly back up a lot of things in a lot of special ways; for example, you could write a custom tool that took a ZFS snapshot of a filesystem and then generated a backup of the snapshot. More complex tricks are possible if people want to write the tools (imagine a database 'backup' program that treated the database as something like a filesystem, indexing it and allowing selective restores).
(Generally, backup systems insist that tools have certain features and capabilities, for example being able to report a list of contents (an index) of a just-made backup in a standard format. It's up to you to adapt existing programs to fit these requirements, perhaps with cover programs.)
The advantage of a backup system that has its own storage format for backups and its own tools for creating them, restoring them, and so on is that the backup system can often offer more features (and better implemented ones). A backup system that relies on other tools for the actual work of creating backups and performing restores is forced to treat those tools as relatively black boxes; a backup system that does this work in-house can tightly integrate things to provide various nice features, like knowing exactly where a specific file you want to restore is within a large backup, or easily performing fine grained backup scheduling and tracking across a lot of files. And the storage format itself can be specifically designed for the needs of backups (and this backup system), instead of being at the mercy of the multiple purposes and historical origins of, say, tar format archives.
(But then the backup system has to do all of the work itself, and fix all of the bugs, if it manages to find them before they damage someone's backup.)
In practice, backup systems seem to go back and forth on this over time depending on their goals (including where they're backing up to) and the state of the commonly available tools on the platforms they want to work on. For commercial backup systems, there can also be commercial reasons to use a custom format that only your own tools can deal with. Over some parts of the past, general tools have been limited and not considered suitable so even open-source people built fully custom systems. Over other parts, the tools have been considered good enough for the goals, so open-source backup systems tended to use them and focus on other areas.
(For open source backup systems it is in some sense a waste to have to write your own tools. There's only so much programming resources you have available and there are lots of things a good backup system needs to implement that can't be outsourced to other tools.)