2017-05-07
ZFS's zfs receive
has no error recovery and what that implies
Generally, 'zfs send
' and 'zfs receive
' are a great thing about
ZFS. They give you easy, efficient, and reliable replication of ZFS
filesystems, which is good for both backups and transfers (and
testing). The combination is among my favorite
ways of shuffling filesystems around, right up there with dump
and restore
and ahead of rsync
. But there is an important caveat
about zfs receive
that deserves to be more widely known, and that
is that zfs receive
makes no attempt to recover from a partially
damaged stream of data.
Formats and programs like tar
, restore
, and so on generally
have provisions for attempting to skip over damaged sections of the
stream of data they're supposed to restore. Sometimes this is
explicitly designed into the stream format; other times the programs
just use heuristics to try to re-synchronize a damaged stream at
the next recognizable object. All of these have the goal of letting
you recover at least something from a stream that's been stored and
damaged in that storage, whether it's a tarball that's experienced
some problems on disk or a dump
image written to tape and now the
tape has a glitch.
zfs receive
does not do any of this. If your stream is damaged
in any way, zfs receive
aborts completely. In order to recover
anything at all from the stream, it must be perfect and completely
undamaged. This is generally okay if you're directly feeding a 'zfs
send
' to 'zfs receive
'; an error means something bad is going
on, and you can retry the send immediately. This is not good if you
are storing the 'zfs send
' output in a file before using 'zfs
receive
'; any damage or problem to the file, and it's completely
unrecoverable and the file is useless. If the 'zfs send
' file is
your only copy of the data, the data is completely lost.
There are some situations where this lack of resilience for saved
send streams is merely annoying. If you're writing the 'zfs send
'
output to a file on media as a way of transporting it around and
the file gets damaged, you can always redo the whole process (just
as you could redo a 'zfs send | zfs receive
' pipeline). But in
other situations this is actively dangerous, for example if the
file is the only form of backup you have or the only copy of the
data. Given this issue I now strongly discourage people from storing
'zfs send
' output unless they're sure they know what they're doing
and they can recover from restore failures.
(See eg this discussion or this one.)
I doubt that ZFS will ever change this behavior, partly because it
would probably require a change in the format of the send/receive
data stream. My impression is that ZFS people have never considered
it a good idea to store 'zfs send
' streams, even if they never
said this very strongly in things like documentation.
(You also wouldn't want continuing to be the default behavior for 'zfs
receive
', at least in normal use. If you have a corrupt stream and you
can retry it, you definitely want to.)
A mistake I made when setting up my ZFS SSD pool on my home machine
I recently actually started
using a pair of SSDs on my home machine, and as part of that I set
up a ZFS pool for my $HOME
and other data that I want to be fast
(as planned). Unfortunately, I recently
realized that when I set that pool up I made a mistake of omission.
Some people who know ZFS can guess my mistake: I didn't force an
ashift
setting, unlike what I did with my work ZFS pools.
(I half-fumbled several aspects of my ZFS pool setup, actually;
for example I forgot to turn on relatime
and set compression to
on at the pool level. But those other things I could fix after
the fact, although sometimes with a bit of pain.)
Unlike spinning rust hard disks, SSDs don't really have a straightforward
physical sector size, and certainly not one that's very useful for
most filesystems (the SSD erase block size is generally too large).
So in practice their reported logical and physical sector sizes are arbitrary, and some drives are
even switchable. As arbitrary
numbers, SSDs report whatever their manufacturer considers convenient.
In my case, Crucial apparently decided to make their MX300 750 GB
SSDs report that they had 512 byte physical sectors. Then ZFS
followed its defaults and created my pool with an ashift
of 9,
which means that I could run into problems if I have to replace
SSDs.
(I'm actually a bit surprised that Crucial SSDs are set up this way; I expected them to report as 4K advanced format drives, since HDs have gone this way and some SSDs switched very abruptly. It's possible that SSD vendors have decided that reporting 512 byte sectors is the easiest or most compatible way forward, at least for consumer SSDs, given that the sizes are arbitrary anyway.)
Unfortunately the only fix for this issue is to destroy the pool
and then recreate it (setting an explicit ashift
this time around),
which means copying all data out of it and then back into it. The
amount of work and hassle involved in this creates the temptation
to not do anything to the pool and just leave things as they are.
On the one hand, it's not guaranteed that I'll have problems in the future. My SSDs might never break and need to be replaced, and if a SSD does need to be replaced it might be that future consumer SSDs will continue to report 512 byte physical sectors and so be perfectly compatible with my current pool. On the other hand, this seems like a risky bet to make, especially since based on my past history this ZFS pool is likely to live a quite long time. My main LVM setup on my current machine is now more than ten years old; I set it up in 2006 and have carried it forward ever since, complete with its ext3 filesystems; I see no reason why this ZFS pool won't be equally durable. In ten years all of the SSDs may well report themselves as 4K physical sector drives simply because that's what all of the (remaining) HDs will report and so that's what all of the software expects.
Now is also my last good opportunity to fix this, because I haven't
put much data in my SSD pool yet and I still have the old pair of
500 GB system HDs in my machine. The 500 GB HDs could easily hold
the data from my SSD ZFS pool, so I could repartition them, set up
a temporary ZFS pool on them, reliably and efficiently copy everything
over to the scratch pool with 'zfs send
' (which is generally
easier than rsync
or the like), then copy it all back later. If
I delay, well, I should pull the old 500 GB disks out and put the
SSDs in their proper place (partly so they get some real airflow
to keep their temperatures down), and then things get more difficult
and annoying.
(I'm partly writing this entry to motivate myself into actually doing all of this. It's the right thing to do, I just have to get around to it.)