How you migrate ZFS filesystems matters
If you want to move a ZFS filesystem around from one host to another,
you have two general approaches; you can use '
zfs send' and '
receive', or you can use a user level copying tool such as
tar -cf | tar -xf', or any number of similar options). Until
recently, I had considered these two approaches to be more or less
equivalent apart from their convenience and speed (which generally
tilted in favour of '
zfs send'). It turns out that this is not
necessarily the case and there are situations where you will want
one instead of the other.
We have had two generations of ZFS fileservers so far, the Solaris
ones and the OmniOS ones.
When we moved from the first generation to the second generation,
we migrated filesystems across using '
zfs send', including the
filesystem with my home directory in it (we did this for various
reasons). Recently I discovered
that some old things in my filesystem didn't have file type
information in their directory entries. ZFS
has been adding file type information to directories for a long
time, but not quite as long as my home directory has been on ZFS.
This illustrates an important difference between the '
approach and the
rsync approach, which is that
zfs send doesn't
update or change at least some ZFS on-disk data structures, in
the way that re-writing them from scratch from user level does.
There are both positives and negatives to this, and a certain amount
of rewriting does happen even in the '
zfs send' case (for example,
all of the block pointers get changed, and ZFS
will re-compress your data as applicable).
I knew that in theory you had to copy things at the user level if
you wanted to make sure that your ZFS filesystem and everything in
it was fully up to date with the latest ZFS features. But I didn't
expect to hit a situation where it mattered in practice until, well,
I did. Now I suspect that old files on our old filesystems may be
partially missing a number of things, and I'm wondering how much
of the various changes in '
zfs upgrade -v' apply even to old data.
(I'd run into this sort of general thing before when I looked into ext3 to ext4 conversion on Linux.)
With all that said, I doubt this will change our plans for migrating our ZFS filesystems in the future (to our third generation fileservers). ZFS sending and receiving is just too convenient, too fast and too reliable to give up. Rsync isn't bad, but it's not the same, and so we only use it when we have to (when we're moving only some of the people in a filesystem instead of all of them, for example).
PS: I was going to try to say something about what '
zfs send' did
and didn't update, but having looked briefly at the code I've
concluded that I need to do more research before running my keyboard
off. In the mean time, you can read the OpenZFS wiki page on ZFS
send and receive,
which has plenty of juicy technical details.
PPS: Since eliminating all-zero blocks is a form of compression, you can turn zero-filled files into sparse files through a ZFS send/receive if the destination has compression enabled. As far as I know, genuine sparse files on the source will stay sparse through a ZFS send/receive even if they're sent to a destination with compression off.
ZFS quietly discards all-zero blocks, but only sometimes
On the ZFS on Linux mailing list, a question came up about whether
ZFS discards writes of all-zero blocks (as you'd get from '
if=/dev/zero of=...'), turning them into holes in your files or,
especially, holes in your zvols. This is especially relevant for
zvols, because if ZFS behaves this way it provides you with a way
of returning a zvol to a sparse state from inside a virtual machine
(or other environment using the zvol):
$ dd if=/dev/zero of=fillfile [... wait for the disk to fill up ...] $ rm -f fillfile
The answer turns out to be that ZFS does discard all-zero blocks
and turn them into holes, but only if you have some sort of compression
turned on (ie, that you don't have the default '
This isn't implemented as part of ZFS ZLE compression (or other
compression methods); instead, it's an entirely separate check that
looks only for an all-zero block and returns a special marker if
that's what it has. As you'd expect, this check is done before ZFS
tries whatever main compression algorithm you set.
Interestingly, there is a special compression level called 'empty'
ZIO_COMPRESS_EMPTY) that only does this special 'discard
zeros' check. You can't set it from user level with something like
compression=empty', but it's used internally in the ZFS code for
a few things. For instance, if you turn off metadata compression
zfs_mdcomp_disable tunable, metadata is still compressed
with this 'empty' compression. Comments in the current ZFS on Linux
source code suggest that ZFS relies on this to do things like discard
blocks in dnode object sets where all the
dnodes in the block are free (which apparently zeroes out the dnode).
There are two consequences of this. The first is that you should
always set at least ZLE compression on zvols, even if their
volblocksize is the same as your pool's
ashift block size and
so they can't otherwise benefit from compression (this would also apply to filesystems
if you set an
recordsize). The second is that it
reinforces how you should basically always turn compression on on
filesystems, even if you think you have mostly incompressible data.
Not only do you save space at the end of files, but you get to drop any all-zero
sections of sparse or pseudo-sparse files.
I took a quick look back through the history of ZFS's code, and as
far as I could see, this zero-block discarding has always been
there, right back to the beginnings of compression (which I believe
came in with ZFS itself).
ZIO_COMPRESS_EMPTY doesn't quite date
back that far; instead, it was introduced along with
zfs_mdcomp_disable, back in 2006.
(All of this is thanks to Gordan Bobic for raising the question in reply to me when I was confidently wrong, which led to me actually looking it up in the code.)
A little bit of the one-time MacOS version still lingers in ZFS
Once upon a time, Apple came very close to releasing ZFS as part of MacOS. Apple did this work in its own copy of the ZFS source base (as far as I know), but the people in Sun knew about it and it turns out that even today there is one little lingering sign of this hoped-for and perhaps prepared-for ZFS port in the ZFS source code. Well, sort of, because it's not quite in code.
Lurking in the function that reads ZFS directories to turn (ZFS) directory entries into the filesystem independent format that the kernel wants is the following comment:
objnum = ZFS_DIRENT_OBJ(zap.za_first_integer); /* * MacOS X can extract the object type here such as: * uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer); */
(Specifically, this is in
zfs_readdir in zfs_vnops.c .)
ZFS maintains file type information in directories. This information can't be used on Solaris
(and thus Illumos), where the overall kernel doesn't have this in
its filesystem independent directory entry format, but it could
have been on MacOS ('Darwin'), because MacOS is among the Unixes
d_type. The comment
itself dates all the way back to this 2007 commit,
which includes the change 'reserve bits in directory entry for file
type', which created the whole setup for this.
I don't know if this file type support was added specifically to help out Apple's MacOS X port of ZFS, but it's certainly possible, and in 2007 it seems likely that this port was at least on the minds of ZFS developers. It's interesting but understandable that FreeBSD didn't seem to have influenced them in the same way, at least as far as comments in the source code go; this file type support is equally useful for FreeBSD, and the FreeBSD ZFS port dates to 2007 too (per this announcement).
Regardless of the exact reason that ZFS picked up maintaining file type information in directory entries, it's quite useful for people on both FreeBSD and Linux that it does so. File type information is useful for any number of things and ZFS filesystems can (and do) provide this information on those Unixes, which helps make ZFS feel like a truly first class filesystem, one that supports all of the expected general system features.
How ZFS maintains file type information in directories
As an aside in yesterday's history of file type information being available in Unix directories, I mentioned that it was possible for a filesystem to support this even though its Unix didn't. By supporting it, I mean that the filesystem maintains this information in its on disk format for directories, even though the rest of the kernel will never ask for it. This is what ZFS does.
(One reason to do this in a filesystem is future-proofing it against a day when your Unix might decide to support this in general; another is if you ever might want the filesystem to be a first class filesystem in another Unix that does support this stuff. In ZFS's case, I suspect that the first motivation was larger than the second one.)
The easiest way to see that ZFS does this is to use
zdb to dump
a directory. I'm going to do this on an OmniOS machine, to make it
more convincing, and it turns out that this has some interesting
results. Since this is OmniOS, we don't have the convenience of
just naming a directory in
zdb, so let's find the root directory
of a filesystem, starting from dnode 1 (as seen before).
# zdb -dddd fs3-corestaff-01/h/281 1 Dataset [....] [...] microzap: 512 bytes, 4 entries [...] ROOT = 3 # zdb -dddd fs3-corestaff-01/h/281 3 Object lvl iblk dblk dsize lsize %full type 3 1 16K 1K 8K 1K 100.00 ZFS directory [...] microzap: 1024 bytes, 8 entries RESTORED = 4396504 (type: Directory) ckstst = 12017 (type: not specified) ckstst3 = 25069 (type: Directory) .demo-file = 5832188 (type: Regular File) .peergroup = 12590 (type: not specified) cks = 5 (type: not specified) cksimap1 = 5247832 (type: Directory) .diskuse = 12016 (type: not specified) ckstst2 = 12535 (type: not specified)
This is actually an old filesystem (it dates from Solaris 10 and
has been transferred around with '
zfs send | zfs recv' since then),
but various home directories for real and test users have been
created in it over time (you can probably guess which one is the
oldest one). Sufficiently old directories and files have no file
type information, but more recent ones have this information,
.demo-file, which I made just now so this would have
an entry that was a regular file with type information.
Once I dug into it, this turned out to be a change introduced (or
activated) in ZFS filesystem version 2, which is described in '
upgrade -v' as 'enhanced directory entries'. As an actual change
in (Open)Solaris, it dates from mid 2007, although I'm not sure
what Solaris release it made it into. The upshot is that if you
made your ZFS filesystem any time in the last decade, you'll have
this file type information in your directories.
How ZFS stores this file type information is interesting and clever,
especially when it comes to backwards compatibility. I'll start by
quoting the comment from
/* * The directory entry has the type (currently unused on * Solaris) in the top 4 bits, and the object number in * the low 48 bits. The "middle" 12 bits are unused. */
In yesterday's entry I said that Unix directory entries need to store at least the filename and the inode number of the file. What ZFS is doing here is reusing the 64 bit field used for the 'inode' (the ZFS dnode number) to also store the file type, because it knows that object numbers have only a limited range. This also makes old directory entries compatible, by making type 0 (all 4 bits 0) mean 'not specified'. Since old directory entries only stored the object number and the object number is 48 bits or less, the higher bits are guaranteed to be all zero.
(It seems common to define
DT_UNKNOWN to be 0; both FreeBSD
and Linux do it.)
The reason this needed a new ZFS filesystem version is now clear. If you tried to read directory entries with file type information on a version of ZFS that didn't know about them, the old version would likely see crazy (and non-existent) object numbers and nothing would work. In order to even read a 'file type in directory entries' filesystem, you need to know to only look at the low 48 bits of the object number field in directory entries.
(As before, I consider this a neat hack that cleverly uses some properties of ZFS and the filesystem to its advantage.)
Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem
The ZFS pools on our fileservers all have overall pool quotas, ultimately because of how we sell storage to people, and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, this has led to fileserver lockups. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota.
The symptoms are less severe, in that the fileserver in question
only get fairly unresponsive to NFS (especially to the machine that
the writes were coming from) instead of locking up. This was somewhat
variable and may have primarily affected the particular filesystem
or perhaps the particular pool it's in, instead of all of the
filesystems and pools on the fileserver; I didn't attempt to gather
this data during the recent incident where I re-observed this, but
certainly some machines could still do things like issue
against the fileserver.
(This was of course our biggest fileserver.)
During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but that was fixed several years ago; still, this could be related.
Our DTrace stuff did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this.
How relevant this is to current OmniOS CE and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem.
(We will probably not attempt to investigate this at all on our current fileservers, since our next general will not run any version of Illumos.)
Some things on Illumos NFS export permissions
Perhaps at one point I fully understood Solaris and thus Illumos NFS export permissions (but I suspect not). If so, that understanding fell out of my mind at some point over the years since then, and just now I had the interesting experience of discovering that our NFS export permissions have sort of been working by accident.
I'll start with a ZFS
sharenfs setting and then break it down.
The two most ornate ones we have look like this:
The AAA and BBB netgroups don't overlap with
nfs_oldmail are both subsets of
The first slightly tricky bit is
root=. As the manual page explains in the NOTES section,
root= does is change the interpretation of UID 0 for
clients that are already allowed to read or write to the NFS share.
Per the manual page, 'the access the host gets is the same as when
root= option is absent' (and this may include no access).
As a corollary, '
root=NG,ro=NG' is basically the same as
ro=NG,anon=0'. Since our
root= netgroups are a subset of our
general allowed-access netgroups, we're okay here.
(This part I sort of knew already, or at least I assumed it without having hunted it down specifically in the manual page. See eg this entry.)
The next tricky bit is the interaction of
just now I would have told you that
rw= took priority over
if you had a host that was included in both (via different netgroups),
but it turns out that whichever one is first takes priority.
We were getting rw-over-ro effects because we always listed
first, but I don't think we necessarily understood that when we
wrote the second
sharenfs setting. The manual page is explicit
ro=options are specified in the same
sec=clause, and a client is in both lists, the order of the two options determines the access the client gets.
(Note that the behavior is different if you use general
See the manpage.)
We would have noticed if we flipped this around for the one filesystem
rw= groups, since the machine that was
supposed to be able to write to the filesystem would have failed
(and the failure would have stalled our mail system). But it's still
sort of a narrow escape.
What this shows me vividly, once again, is the appeal of casual superstition. I really thought I understood how Illumos NFS exports worked (and I only checked the manpage to see if it described things explicitly, and that because I was writing an entry for here). Instead I had drifted into a convenient assumption of how things were.
Sidebar: Our general narrow miss on this
We have a bunch of local programs for managing our fileservers. One of the things these programs do is manipulate NFS exports options, so that we can have a configuration file that sets general share options and then allows us to specify that specific filesystems extend them, with convenient syntax, eg:
# global share options shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root # our SAN filesystems: fs3-corestaff-01 /h/281 rw+=AAA
This means that /h/281 should be exported read-write to the AAA netgroup as well as the usual main netgroup for our own machines.
The actual code is written in Python and turns all of the NFS exports
options into Python dictionary keys and values. Python dictionaries
are unordered, so under normal circumstances reassembling the exports
options would have put them into some random order, so anything
ro= could have wound up in the wrong order.
However, conveniently I decided to put the NFS export options into
a canonical order when I converted them back to string form, and
sec=sys before both). There's
no sign in my code comments that I knew this was important; it seems
to have just been what I thought of as the correct canonical ordering.
Possibly I was blindly copying and preserving earlier work where
we always had
Understanding ZFS System Attributes
Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.
Like most filesystems, ZFS puts these attributes in dnodes using
some extra space (in what is called the dnode 'bonus buffer').
However, the ZFS trick is that whatever system attributes a dnode
has are simply packed into that space without being organized into
formal structures with a fixed order of attributes. Code that uses
system attributes retrieves them from dnodes indirectly by asking
for, say, the
ZPL_PARENT of a dnode; it never cares exactly how
they're packed into a given dnode. However, obviously something
One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.
(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)
I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.
As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.
By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.
How ZFS makes things like '
zfs diff' report filenames efficiently
As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.
In theory, turning an inode or dnode number into the path to a file
is an expensive operation; you basically have to search the entire
filesystem until you find it. In practice, if you've ever run '
diff', you've likely noticed that it runs pretty fast. Nor is
this the only place that ZFS quickly turns dnode numbers into full
paths, as it comes up in '
zpool status' reports about permanent
errors. At one level,
zfs diff and
zpool status do this so rapidly because they ask the ZFS code in
the kernel to do it for them. At another level, the question is how
the kernel's ZFS code can be so fast.
The interesting and surprising answer is that ZFS cheats, in a way
that makes things very fast when it works and almost always works
in normal filesystems and with normal usage patterns. The cheat is
that ZFS dnodes record their parent's object number. Here, let's
show this in
# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File)
b file has a
parent field that points to
directory it's in, and the
a directory has a
parent field that
cks/tmp, and so on. When the kernel wants to get the
name for a given object number, it can just fetch the object, look
parent, and start going back up the filesystem.
If you're familiar with the twists and turns of Unix filesystems,
you're now wondering how ZFS deals with hardlinks, which can cause
a file to be in several directories at once and so have several
parents (and then it can be removed from some of the directories).
The answer is that ZFS doesn't; a dnode only ever tracks a single
parent, and ZFS accepts that this parent information can be
inaccurate. I'll quote the comment in
When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.
Before I get into the details, I want to say that I appreciate the
brute force elegance of this cheat. The practical reality is that
most Unix files today don't have extra hardlinks, and when they do
most hardlinks are done in ways that won't break ZFS's
stuff. The result is that ZFS has picked an efficient implementation
that works almost all of the time; in my opinion, the great benefit
we get from having it around are more than worth the infrequent
cases where it fails or malfunctions. Both
zfs diff and having
filenames show up in
zpool status permanent error reports are
very useful (and there may be other cases where this gets used).
The current details are that any time you hardlink a file to somewhere
or rename it, ZFS updates the file's
parent to point to the new
directory. Often this will wind up with a correct
after all of the dust settles; for example, a common pattern is to
write a file to an initial location, hardlink it to its final
destination, and then remove the initial location version. In this
parent will be correct and you'll get the right name.
The time when you get an incorrect
parent is this sequence:
; mkdir a b; touch a/demo ; ln a/demo b/ ; rm b/demo
a/demo is the remaining path, but
demo's dnode will claim
that its parent is
b. I believe that
zfs diff will even report
this as the path, because the kernel doesn't do the extra work
to scan the
b directory to verify that
demo is present in it.
(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)
What ZFS block pointers are and what's in them
I've mentioned ZFS block pointers in the past; for example, when I wrote about some details of ZFS DVAs, I said that DVAs are embedded in block pointers. But I've never really looked carefully at what is in block pointers and what that means and implies for ZFS.
The very simple way to describe a ZFS block pointer is that it's what ZFS uses in places where other filesystems would simply put a block number. Just like block numbers but unlike things like ZFS dnodes, a block pointer isn't a separate on-disk entity; instead it's an on disk data format and an in memory structure that shows up in other things. To quote from the (draft and old) ZFS on-disk specification (PDF):
A block pointer (blkptr_t) is a 128 byte ZFS structure used to physically locate, verify, and describe blocks of data on disk.
Block pointers are embedded in any ZFS on disk structure that points directly to other disk blocks, both for data and metadata. For instance, the dnode for a file contains block pointers that refer to either its data blocks (if it's small enough) or indirect blocks, as I saw in this entry. However, as I discovered when I paid attention, most things in ZFS only point to dnodes indirectly, by giving their object number (either in a ZFS filesystem or in pool-wide metadata).
So what's in a block pointer itself? You can find the technical details for modern ZFS in spa.h, so I'm going to give a sort of summary. A regular block pointer contains:
- various metadata and flags about what the block pointer is for and
what parts of it mean, including what type of object it points to.
- Up to three DVAs that say where to actually
find the data on disk. There can be more than one DVA because you
may have set the
copiesproperty to 2 or 3, or this may be metadata (which normally has two copies and may have more for sufficiently important metadata).
- The logical size (size before compression) and 'physical' size (the
nominal size after compression) of the disk block. The physical
size can do odd things and is not
necessarily the asize (allocated size) for the DVA(s).
- The txgs that the block was born in, both logically
and physically (the physical txg is apparently for dva). The
physical txg was added with ZFS deduplication but apparently also
shows up in vdev removal.
- The checksum of the data the block pointer describes. This checksum implicitly covers the entire logical size of the data, and as a result you must read all of the data in order to verify it. This can be an issue on raidz vdevs or if the block had to use gang blocks.
Just like basically everything else in ZFS, block pointers don't have an explicit checksum of their contents. Instead they're implicitly covered by the checksum of whatever they're embedded in; the block pointers in a dnode are covered by the overall checksum of the dnode, for example. Block pointers must include a checksum for the data they point to because such data is 'out of line' for the containing object.
(The block pointers in a dnode don't necessarily point straight to data. If there's more than a bit of data in whatever the dnode covers, the dnode's block pointers will instead point to some level of indirect block, which itself has some number of block pointers.)
There is a special type of block pointer called an embedded block pointer. Embedded block pointers directly contain up to 112 bytes of data; apart from the data, they contain only the metadata fields and a logical birth txg. As with conventional block pointers, this data is implicitly covered by the checksum of the containing object.
Since block pointers directly contain the address of things on disk (in the form of DVAs), they have to change any time that address changes, which means any time ZFS does its copy on write thing. This forces a change in whatever contains the block pointer, which in turn ripples up to another block pointer (whatever points to said containing thing), and so on until we eventually reach the Meta Object Set and the uberblock. How this works is a bit complicated, but ZFS is designed to generally make this a relatively shallow change with not many levels of things involved (as I discovered recently).
As far as I understand things, the logical birth txg of a block pointer is the transaction group in which the block pointer was allocated. Because of ZFS's copy on write principle, this means that nothing underneath the block pointer has been updated or changed since that txg; if something changed, it would have been written to a new place on disk, which would have forced a change in at least one DVA and thus a ripple of updates that would update the logical birth txg.
However, this doesn't quite mean what I used to think it meant because of ZFS's level of indirection. If you change a file by writing data to it, you will change some of the file's block pointers, updating their logical birth txg, and you will change the file's dnode. However, you won't change any block pointers and thus any logical birth txgs for the filesystem directory the file is in (or anything else up the directory tree), because the directory refers to the file through its object number, not by directly pointing to its dnode. You can still use logical birth txgs to efficiently find changes from one txg to another, but you won't necessarily get a filesystem level view of these changes; instead, as far as I can see, you will basically get a view of what object(s) in a filesystem changed (effectively, what inode numbers changed).
(ZFS has an interesting hack to make things like '
zfs diff' work
far more efficiently than you would expect in light of this, but
that's going to take yet another entry to cover.)
A broad overview of how ZFS is structured on disk
When I wrote yesterday's entry, it became clear that I didn't understand as much about how ZFS is structured on disk (and that this matters, since I thought that ZFS copy on write updates updated a lot more than they do). So today I want to write down my new broad understanding of how this works.
(All of this can be dug out of the old, draft ZFS on-disk format specification, but that spec is written in a very detailed way and things aren't always immediately clear from it.)
Almost everything in ZFS is in DMU object. All objects are defined by a dnode, and object dnodes are almost always grouped together in an object set. Object sets are themselves DMU objects; they store dnodes as basically a giant array in a 'file', which uses data blocks and indirect blocks and so on, just like anything else. Within a single object set, dnodes have an object number, which is the index of their position in the object set's array of dnodes.
(Because an object number is just the index of the object's dnode
in its object set's array of dnodes, object numbers are basically
always going to be duplicated between object sets (and they're
always relative to an object set). For instance, pretty much every
object set is going to have an object number ten, although not all
object sets may have enough objects that they have an object number
One corollary of this is that if you ask
zdb to tell you about
a given object number, you have to tell
zdb what object set you're
talking about. Usually you do this by telling
zdb which ZFS
filesystem or dataset you mean.)
Each ZFS filesystem has its own object set for objects (and thus dnodes) used in the filesystem. As I discovered yesterday, every ZFS filesystem has a directory hierarchy and it may go many levels deep, but all of this directory hierarchy refers to directories and files using their object number.
ZFS organizes and keeps track of filesystems, clones, and snapshots through the DSL (Dataset and Snapshot Layer). The DSL has all sorts of things; DSL directories, DSL datasets, and so on, all of which are objects and many of which refer to object sets (for example, every ZFS filesystem must refer to its current object set somehow). All of these DSL objects are themselves stored as dnodes in another object set, the Meta Object Set, which the uberblock points to. To my surprise, object sets are not stored in the MOS (and as a result do not have 'object numbers'). Object sets are always referred to directly, without indirection, using a block pointer to the object set's dnode.
(I think object sets are referred to directly so that snapshots can freeze their object set very simply.)
The DSL directories and datasets for your pool's set of filesystems form a tree themselves (each filesystem has a DSL directory and at least one DSL dataset). However, just like in ZFS filesystems, all of the objects in this second tree refer to each other indirectly, by their MOS object number. Just as with files in ZFS filesystems, this level of indirection limits the amount of copy on write updates that ZFS had to do when something changes.
PS: If you want to examine MOS objects with
zdb, I think you do
it with something like '
zdb -vvv -d ssddata 1', which will get
you object number 1 of the MOS, which is the MOS object directory.
If you want to ask
zdb about an object in the pool's root filesystem,
zdb -vvv -d ssddata/ 1'. You can tell which one you're
getting depending on what
zdb prints out. If it says 'Dataset
mos [META]' you're looking at objects from the MOS; if it says
'Dataset ssddata [ZPL]', you're looking at the pool's root filesystem
(where object number 1 is the ZFS master node).
PPS: I was going to write up what changed on a filesystem write, but then I realized that I didn't know how blocks being allocated and freed are reflected in pool structures. So I'll just say that I think that ignoring free space management, only four DMU objects get updated; the file itself, the filesystem's object set, the filesystem's DSL dataset object, and the MOS.
(As usual, doing the research to write this up taught me things that I didn't know about ZFS.)