2017-05-11
The challenges of recovering when unpacking archives with damage
I wrote recently about how 'zfs receive
' makes no attempt to
recover from damaged input,
which means that if you save 'zfs send
' output somewhere and your
saved file gets damaged, you are up the proverbial creek. It is
worth mentioning that this is not an easy or simple problem to solve
in general, and that doing a good job of this is likely going to
affect a number of aspects of your archive file format and how it's
processed. So let's talk a bit about what's needed here.
The first and most obvious thing you need is an archive format that makes it possible to detect and then recover from damage. Detection is in some sense easy; you checksum everything and then when a checksum fails, you know damage has started. More broadly, there are several sorts of damage you need to worry about: data that is corrupt in place, data that has been removed, and data that has been inserted. It would be nice if we could assume that data will only get corrupted in place, but my feelings are that this assumption is unwise.
(For instance, you may get 'removed data' if something reading a file off disk hits a corrupt spot and spits out only partial or no data for it when it continues on to read the rest of the file.)
In-place corruption can be detected and then skipped with checksums; you skip any block that fails its checksum, and you resume processing when the checksums start verifying again. Once data can be added or removed, you also need to be able to re-synchronize the data stream to do things like find the next start of a block; this implies that your data format should have markers, and perhaps some sort of escape or encoding scheme so that the markers can never appear in actual data. You want re-synchronization in your format in general anyway, because one of the things that can get corrupt is the 'start of file' marker; if it gets corrupted, you obviously need to be able to unambiguously find the start of the next file.
(If you prefer, call this a more general 'start of object' marker, or just metadata in general.)
So you have an archive file format that has internal markers for redundancy and where you can damage it and resynchronize with as little data lost and unusable as possible. But this is just the start. Now you need to look at the overall structure of your archive and ask what happens if you lost some chunk of metadata; how much of the archive is unable to be usefully processed? For example, suppose that data in the archive is identified by inode number, you have a table mapping inode numbers to filenames, and this table can only be understood with the aid of a header block. Then if you lose the header block to corruption, you lose all of the filenames for everything in the archive. The data in the archive may be readable in theory, but it's not useful in practice unless you're desperate (since you'd have to go through a sea of files identified only by inode number to figure out what they are and what directory structure they might go into).
Designing a resilient archive format, one that recovers as much as possible in the face of corruption, often means designing an inconvenient or at least inefficient one. If you want to avoid loss from corruption, both redundancy and distributing crucial information around the archive are your friends. Conversely, clever efficient formats full of highly compressed and optimized things are generally not good.
You can certainly create archive formats that are resilient this way. But it's unlikely to happen by accident or happenstance, which means that an archive format created without resilience in mind probably won't be all that resilient even if you try to make the software that processes it do its best to recover and continue in the face of damaged input.