Wandering Thoughts


How you migrate ZFS filesystems matters

If you want to move a ZFS filesystem around from one host to another, you have two general approaches; you can use 'zfs send' and 'zfs receive', or you can use a user level copying tool such as rsync (or 'tar -cf | tar -xf', or any number of similar options). Until recently, I had considered these two approaches to be more or less equivalent apart from their convenience and speed (which generally tilted in favour of 'zfs send'). It turns out that this is not necessarily the case and there are situations where you will want one instead of the other.

We have had two generations of ZFS fileservers so far, the Solaris ones and the OmniOS ones. When we moved from the first generation to the second generation, we migrated filesystems across using 'zfs send', including the filesystem with my home directory in it (we did this for various reasons). Recently I discovered that some old things in my filesystem didn't have file type information in their directory entries. ZFS has been adding file type information to directories for a long time, but not quite as long as my home directory has been on ZFS.

This illustrates an important difference between the 'zfs send' approach and the rsync approach, which is that zfs send doesn't update or change at least some ZFS on-disk data structures, in the way that re-writing them from scratch from user level does. There are both positives and negatives to this, and a certain amount of rewriting does happen even in the 'zfs send' case (for example, all of the block pointers get changed, and ZFS will re-compress your data as applicable).

I knew that in theory you had to copy things at the user level if you wanted to make sure that your ZFS filesystem and everything in it was fully up to date with the latest ZFS features. But I didn't expect to hit a situation where it mattered in practice until, well, I did. Now I suspect that old files on our old filesystems may be partially missing a number of things, and I'm wondering how much of the various changes in 'zfs upgrade -v' apply even to old data.

(I'd run into this sort of general thing before when I looked into ext3 to ext4 conversion on Linux.)

With all that said, I doubt this will change our plans for migrating our ZFS filesystems in the future (to our third generation fileservers). ZFS sending and receiving is just too convenient, too fast and too reliable to give up. Rsync isn't bad, but it's not the same, and so we only use it when we have to (when we're moving only some of the people in a filesystem instead of all of them, for example).

PS: I was going to try to say something about what 'zfs send' did and didn't update, but having looked briefly at the code I've concluded that I need to do more research before running my keyboard off. In the mean time, you can read the OpenZFS wiki page on ZFS send and receive, which has plenty of juicy technical details.

PPS: Since eliminating all-zero blocks is a form of compression, you can turn zero-filled files into sparse files through a ZFS send/receive if the destination has compression enabled. As far as I know, genuine sparse files on the source will stay sparse through a ZFS send/receive even if they're sent to a destination with compression off.

ZFSSendRecvVsRsync written at 00:17:58; Add Comment


ZFS quietly discards all-zero blocks, but only sometimes

On the ZFS on Linux mailing list, a question came up about whether ZFS discards writes of all-zero blocks (as you'd get from 'dd if=/dev/zero of=...'), turning them into holes in your files or, especially, holes in your zvols. This is especially relevant for zvols, because if ZFS behaves this way it provides you with a way of returning a zvol to a sparse state from inside a virtual machine (or other environment using the zvol):

$ dd if=/dev/zero of=fillfile
[... wait for the disk to fill up ...]
$ rm -f fillfile

The answer turns out to be that ZFS does discard all-zero blocks and turn them into holes, but only if you have some sort of compression turned on (ie, that you don't have the default 'compression=off'). This isn't implemented as part of ZFS ZLE compression (or other compression methods); instead, it's an entirely separate check that looks only for an all-zero block and returns a special marker if that's what it has. As you'd expect, this check is done before ZFS tries whatever main compression algorithm you set.

Interestingly, there is a special compression level called 'empty' (ZIO_COMPRESS_EMPTY) that only does this special 'discard zeros' check. You can't set it from user level with something like 'compression=empty', but it's used internally in the ZFS code for a few things. For instance, if you turn off metadata compression with the zfs_mdcomp_disable tunable, metadata is still compressed with this 'empty' compression. Comments in the current ZFS on Linux source code suggest that ZFS relies on this to do things like discard blocks in dnode object sets where all the dnodes in the block are free (which apparently zeroes out the dnode).

There are two consequences of this. The first is that you should always set at least ZLE compression on zvols, even if their volblocksize is the same as your pool's ashift block size and so they can't otherwise benefit from compression (this would also apply to filesystems if you set an ashift-sized recordsize). The second is that it reinforces how you should basically always turn compression on on filesystems, even if you think you have mostly incompressible data. Not only do you save space at the end of files, but you get to drop any all-zero sections of sparse or pseudo-sparse files.

(Looking back, Richard Laager mentioned this zero block discarding for zvols back in a comment on this entry of mine, but apparently it didn't stick in my mind. Also, now I know the details.)

I took a quick look back through the history of ZFS's code, and as far as I could see, this zero-block discarding has always been there, right back to the beginnings of compression (which I believe came in with ZFS itself). ZIO_COMPRESS_EMPTY doesn't quite date back that far; instead, it was introduced along with zfs_mdcomp_disable, back in 2006.

(All of this is thanks to Gordan Bobic for raising the question in reply to me when I was confidently wrong, which led to me actually looking it up in the code.)

ZFSZeroBlockDiscarding written at 00:33:46; Add Comment


A little bit of the one-time MacOS version still lingers in ZFS

Once upon a time, Apple came very close to releasing ZFS as part of MacOS. Apple did this work in its own copy of the ZFS source base (as far as I know), but the people in Sun knew about it and it turns out that even today there is one little lingering sign of this hoped-for and perhaps prepared-for ZFS port in the ZFS source code. Well, sort of, because it's not quite in code.

Lurking in the function that reads ZFS directories to turn (ZFS) directory entries into the filesystem independent format that the kernel wants is the following comment:

 objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
  * MacOS X can extract the object type here such as:
  * uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);

(Specifically, this is in zfs_readdir in zfs_vnops.c .)

ZFS maintains file type information in directories. This information can't be used on Solaris (and thus Illumos), where the overall kernel doesn't have this in its filesystem independent directory entry format, but it could have been on MacOS ('Darwin'), because MacOS is among the Unixes that support d_type. The comment itself dates all the way back to this 2007 commit, which includes the change 'reserve bits in directory entry for file type', which created the whole setup for this.

I don't know if this file type support was added specifically to help out Apple's MacOS X port of ZFS, but it's certainly possible, and in 2007 it seems likely that this port was at least on the minds of ZFS developers. It's interesting but understandable that FreeBSD didn't seem to have influenced them in the same way, at least as far as comments in the source code go; this file type support is equally useful for FreeBSD, and the FreeBSD ZFS port dates to 2007 too (per this announcement).

Regardless of the exact reason that ZFS picked up maintaining file type information in directory entries, it's quite useful for people on both FreeBSD and Linux that it does so. File type information is useful for any number of things and ZFS filesystems can (and do) provide this information on those Unixes, which helps make ZFS feel like a truly first class filesystem, one that supports all of the expected general system features.

ZFSDTypeAndMacOS written at 21:24:29; Add Comment

How ZFS maintains file type information in directories

As an aside in yesterday's history of file type information being available in Unix directories, I mentioned that it was possible for a filesystem to support this even though its Unix didn't. By supporting it, I mean that the filesystem maintains this information in its on disk format for directories, even though the rest of the kernel will never ask for it. This is what ZFS does.

(One reason to do this in a filesystem is future-proofing it against a day when your Unix might decide to support this in general; another is if you ever might want the filesystem to be a first class filesystem in another Unix that does support this stuff. In ZFS's case, I suspect that the first motivation was larger than the second one.)

The easiest way to see that ZFS does this is to use zdb to dump a directory. I'm going to do this on an OmniOS machine, to make it more convincing, and it turns out that this has some interesting results. Since this is OmniOS, we don't have the convenience of just naming a directory in zdb, so let's find the root directory of a filesystem, starting from dnode 1 (as seen before).

# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]
    microzap: 512 bytes, 4 entries
         ROOT = 3 

# zdb -dddd fs3-corestaff-01/h/281 3
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        3    1    16K     1K     8K     1K  100.00  ZFS directory
    microzap: 1024 bytes, 8 entries

         RESTORED = 4396504 (type: Directory)
         ckstst = 12017 (type: not specified)
         ckstst3 = 25069 (type: Directory)
         .demo-file = 5832188 (type: Regular File)
         .peergroup = 12590 (type: not specified)
         cks = 5 (type: not specified)
         cksimap1 = 5247832 (type: Directory)
         .diskuse = 12016 (type: not specified)
         ckstst2 = 12535 (type: not specified)

This is actually an old filesystem (it dates from Solaris 10 and has been transferred around with 'zfs send | zfs recv' since then), but various home directories for real and test users have been created in it over time (you can probably guess which one is the oldest one). Sufficiently old directories and files have no file type information, but more recent ones have this information, including .demo-file, which I made just now so this would have an entry that was a regular file with type information.

Once I dug into it, this turned out to be a change introduced (or activated) in ZFS filesystem version 2, which is described in 'zfs upgrade -v' as 'enhanced directory entries'. As an actual change in (Open)Solaris, it dates from mid 2007, although I'm not sure what Solaris release it made it into. The upshot is that if you made your ZFS filesystem any time in the last decade, you'll have this file type information in your directories.

How ZFS stores this file type information is interesting and clever, especially when it comes to backwards compatibility. I'll start by quoting the comment from zfs_znode.h:

 * The directory entry has the type (currently unused on
 * Solaris) in the top 4 bits, and the object number in
 * the low 48 bits.  The "middle" 12 bits are unused.

In yesterday's entry I said that Unix directory entries need to store at least the filename and the inode number of the file. What ZFS is doing here is reusing the 64 bit field used for the 'inode' (the ZFS dnode number) to also store the file type, because it knows that object numbers have only a limited range. This also makes old directory entries compatible, by making type 0 (all 4 bits 0) mean 'not specified'. Since old directory entries only stored the object number and the object number is 48 bits or less, the higher bits are guaranteed to be all zero.

(It seems common to define DT_UNKNOWN to be 0; both FreeBSD and Linux do it.)

The reason this needed a new ZFS filesystem version is now clear. If you tried to read directory entries with file type information on a version of ZFS that didn't know about them, the old version would likely see crazy (and non-existent) object numbers and nothing would work. In order to even read a 'file type in directory entries' filesystem, you need to know to only look at the low 48 bits of the object number field in directory entries.

(As before, I consider this a neat hack that cleverly uses some properties of ZFS and the filesystem to its advantage.)

ZFSAndDirectoryDType written at 00:43:13; Add Comment


Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem

The ZFS pools on our fileservers all have overall pool quotas, ultimately because of how we sell storage to people, and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, this has led to fileserver lockups. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota.

The symptoms are less severe, in that the fileserver in question only get fairly unresponsive to NFS (especially to the machine that the writes were coming from) instead of locking up. This was somewhat variable and may have primarily affected the particular filesystem or perhaps the particular pool it's in, instead of all of the filesystems and pools on the fileserver; I didn't attempt to gather this data during the recent incident where I re-observed this, but certainly some machines could still do things like issue dfs against the fileserver.

(This was of course our biggest fileserver.)

During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but that was fixed several years ago; still, this could be related.

Our DTrace stuff did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this.

How relevant this is to current OmniOS CE and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem.

(We will probably not attempt to investigate this at all on our current fileservers, since our next general will not run any version of Illumos.)

ZFSNFSFilesystemQuotaProblem written at 00:41:14; Add Comment


Some things on Illumos NFS export permissions

Perhaps at one point I fully understood Solaris and thus Illumos NFS export permissions (but I suspect not). If so, that understanding fell out of my mind at some point over the years since then, and just now I had the interesting experience of discovering that our NFS export permissions have sort of been working by accident.

I'll start with a ZFS sharenfs setting and then break it down. The two most ornate ones we have look like this:


The AAA and BBB netgroups don't overlap with nfs_ssh, but nfs_root and nfs_oldmail are both subsets of nfs_ssh.

The first slightly tricky bit is root=. As the manual page explains in the NOTES section, all that root= does is change the interpretation of UID 0 for clients that are already allowed to read or write to the NFS share. Per the manual page, 'the access the host gets is the same as when the root= option is absent' (and this may include no access). As a corollary, 'root=NG,ro=NG' is basically the same as 'ro=NG,anon=0'. Since our root= netgroups are a subset of our general allowed-access netgroups, we're okay here.

(This part I sort of knew already, or at least I assumed it without having hunted it down specifically in the manual page. See eg this entry.)

The next tricky bit is the interaction of rw= and ro=. Before just now I would have told you that rw= took priority over ro= if you had a host that was included in both (via different netgroups), but it turns out that whichever one is first takes priority. We were getting rw-over-ro effects because we always listed rw= first, but I don't think we necessarily understood that when we wrote the second sharenfs setting. The manual page is explicit about this:

If rw= and ro= options are specified in the same sec= clause, and a client is in both lists, the order of the two options determines the access the client gets.

(Note that the behavior is different if you use general ro or rw. See the manpage.)

We would have noticed if we flipped this around for the one filesystem with overlapping ro= and rw= groups, since the machine that was supposed to be able to write to the filesystem would have failed (and the failure would have stalled our mail system). But it's still sort of a narrow escape.

What this shows me vividly, once again, is the appeal of casual superstition. I really thought I understood how Illumos NFS exports worked (and I only checked the manpage to see if it described things explicitly, and that because I was writing an entry for here). Instead I had drifted into a convenient assumption of how things were.

Sidebar: Our general narrow miss on this

We have a bunch of local programs for managing our fileservers. One of the things these programs do is manipulate NFS exports options, so that we can have a configuration file that sets general share options and then allows us to specify that specific filesystems extend them, with convenient syntax, eg:

# global share options 
shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root

# our SAN filesystems:
fs3-corestaff-01   /h/281   rw+=AAA

This means that /h/281 should be exported read-write to the AAA netgroup as well as the usual main netgroup for our own machines.

The actual code is written in Python and turns all of the NFS exports options into Python dictionary keys and values. Python dictionaries are unordered, so under normal circumstances reassembling the exports options would have put them into some random order, so anything with both rw= and ro= could have wound up in the wrong order. However, conveniently I decided to put the NFS export options into a canonical order when I converted them back to string form, and this put rw= before ro= (and sec=sys before both). There's no sign in my code comments that I knew this was important; it seems to have just been what I thought of as the correct canonical ordering. Possibly I was blindly copying and preserving earlier work where we always had rw= first.

IllumosNFSExportsPerms written at 23:43:51; Add Comment


Understanding ZFS System Attributes

Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.

Like most filesystems, ZFS puts these attributes in dnodes using some extra space (in what is called the dnode 'bonus buffer'). However, the ZFS trick is that whatever system attributes a dnode has are simply packed into that space without being organized into formal structures with a fixed order of attributes. Code that uses system attributes retrieves them from dnodes indirectly by asking for, say, the ZPL_PARENT of a dnode; it never cares exactly how they're packed into a given dnode. However, obviously something does.

One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.

(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)

I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.

As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.

By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.

ZFSSystemAttributes written at 01:11:37; Add Comment


How ZFS makes things like 'zfs diff' report filenames efficiently

As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.

In theory, turning an inode or dnode number into the path to a file is an expensive operation; you basically have to search the entire filesystem until you find it. In practice, if you've ever run 'zfs diff', you've likely noticed that it runs pretty fast. Nor is this the only place that ZFS quickly turns dnode numbers into full paths, as it comes up in 'zpool status' reports about permanent errors. At one level, zfs diff and zpool status do this so rapidly because they ask the ZFS code in the kernel to do it for them. At another level, the question is how the kernel's ZFS code can be so fast.

The interesting and surprising answer is that ZFS cheats, in a way that makes things very fast when it works and almost always works in normal filesystems and with normal usage patterns. The cheat is that ZFS dnodes record their parent's object number. Here, let's show this in zdb:

# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1285414    1   128K    512      0     512    512    0.00  ZFS plain file
       parent  1284472
# zdb -vvv -bbbb -O ssddata/homes cks/tmp/a
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1284472    1   128K    512      0     512    512  100.00  ZFS directory
       parent  52906
       microzap: 512 bytes, 1 entries
          b = 1285414 (type: Regular File)

The b file has a parent field that points to cks/tmp/a, the directory it's in, and the a directory has a parent field that points to cks/tmp, and so on. When the kernel wants to get the name for a given object number, it can just fetch the object, look at parent, and start going back up the filesystem.

(If you want to see this sausage being made, look at zfs_obj_to_path and zfs_obj_to_pobj in zfs_znode.c. The parent field is a ZFS dnode system attribute, specifically ZPL_PARENT.)

If you're familiar with the twists and turns of Unix filesystems, you're now wondering how ZFS deals with hardlinks, which can cause a file to be in several directories at once and so have several parents (and then it can be removed from some of the directories). The answer is that ZFS doesn't; a dnode only ever tracks a single parent, and ZFS accepts that this parent information can be inaccurate. I'll quote the comment in zfs_obj_to_pobj:

When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.

Before I get into the details, I want to say that I appreciate the brute force elegance of this cheat. The practical reality is that most Unix files today don't have extra hardlinks, and when they do most hardlinks are done in ways that won't break ZFS's parent stuff. The result is that ZFS has picked an efficient implementation that works almost all of the time; in my opinion, the great benefit we get from having it around are more than worth the infrequent cases where it fails or malfunctions. Both zfs diff and having filenames show up in zpool status permanent error reports are very useful (and there may be other cases where this gets used).

The current details are that any time you hardlink a file to somewhere or rename it, ZFS updates the file's parent to point to the new directory. Often this will wind up with a correct parent even after all of the dust settles; for example, a common pattern is to write a file to an initial location, hardlink it to its final destination, and then remove the initial location version. In this case, the parent will be correct and you'll get the right name. The time when you get an incorrect parent is this sequence:

; mkdir a b; touch a/demo
; ln a/demo b/
; rm b/demo

Here a/demo is the remaining path, but demo's dnode will claim that its parent is b. I believe that zfs diff will even report this as the path, because the kernel doesn't do the extra work to scan the b directory to verify that demo is present in it.

(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)

ZFSPathLookupTrick written at 00:51:38; Add Comment


What ZFS block pointers are and what's in them

I've mentioned ZFS block pointers in the past; for example, when I wrote about some details of ZFS DVAs, I said that DVAs are embedded in block pointers. But I've never really looked carefully at what is in block pointers and what that means and implies for ZFS.

The very simple way to describe a ZFS block pointer is that it's what ZFS uses in places where other filesystems would simply put a block number. Just like block numbers but unlike things like ZFS dnodes, a block pointer isn't a separate on-disk entity; instead it's an on disk data format and an in memory structure that shows up in other things. To quote from the (draft and old) ZFS on-disk specification (PDF):

A block pointer (blkptr_t) is a 128 byte ZFS structure used to physically locate, verify, and describe blocks of data on disk.

Block pointers are embedded in any ZFS on disk structure that points directly to other disk blocks, both for data and metadata. For instance, the dnode for a file contains block pointers that refer to either its data blocks (if it's small enough) or indirect blocks, as I saw in this entry. However, as I discovered when I paid attention, most things in ZFS only point to dnodes indirectly, by giving their object number (either in a ZFS filesystem or in pool-wide metadata).

So what's in a block pointer itself? You can find the technical details for modern ZFS in spa.h, so I'm going to give a sort of summary. A regular block pointer contains:

  • various metadata and flags about what the block pointer is for and what parts of it mean, including what type of object it points to.

  • Up to three DVAs that say where to actually find the data on disk. There can be more than one DVA because you may have set the copies property to 2 or 3, or this may be metadata (which normally has two copies and may have more for sufficiently important metadata).

  • The logical size (size before compression) and 'physical' size (the nominal size after compression) of the disk block. The physical size can do odd things and is not necessarily the asize (allocated size) for the DVA(s).

  • The txgs that the block was born in, both logically and physically (the physical txg is apparently for dva[0]). The physical txg was added with ZFS deduplication but apparently also shows up in vdev removal.

  • The checksum of the data the block pointer describes. This checksum implicitly covers the entire logical size of the data, and as a result you must read all of the data in order to verify it. This can be an issue on raidz vdevs or if the block had to use gang blocks.

Just like basically everything else in ZFS, block pointers don't have an explicit checksum of their contents. Instead they're implicitly covered by the checksum of whatever they're embedded in; the block pointers in a dnode are covered by the overall checksum of the dnode, for example. Block pointers must include a checksum for the data they point to because such data is 'out of line' for the containing object.

(The block pointers in a dnode don't necessarily point straight to data. If there's more than a bit of data in whatever the dnode covers, the dnode's block pointers will instead point to some level of indirect block, which itself has some number of block pointers.)

There is a special type of block pointer called an embedded block pointer. Embedded block pointers directly contain up to 112 bytes of data; apart from the data, they contain only the metadata fields and a logical birth txg. As with conventional block pointers, this data is implicitly covered by the checksum of the containing object.

Since block pointers directly contain the address of things on disk (in the form of DVAs), they have to change any time that address changes, which means any time ZFS does its copy on write thing. This forces a change in whatever contains the block pointer, which in turn ripples up to another block pointer (whatever points to said containing thing), and so on until we eventually reach the Meta Object Set and the uberblock. How this works is a bit complicated, but ZFS is designed to generally make this a relatively shallow change with not many levels of things involved (as I discovered recently).

As far as I understand things, the logical birth txg of a block pointer is the transaction group in which the block pointer was allocated. Because of ZFS's copy on write principle, this means that nothing underneath the block pointer has been updated or changed since that txg; if something changed, it would have been written to a new place on disk, which would have forced a change in at least one DVA and thus a ripple of updates that would update the logical birth txg.

However, this doesn't quite mean what I used to think it meant because of ZFS's level of indirection. If you change a file by writing data to it, you will change some of the file's block pointers, updating their logical birth txg, and you will change the file's dnode. However, you won't change any block pointers and thus any logical birth txgs for the filesystem directory the file is in (or anything else up the directory tree), because the directory refers to the file through its object number, not by directly pointing to its dnode. You can still use logical birth txgs to efficiently find changes from one txg to another, but you won't necessarily get a filesystem level view of these changes; instead, as far as I can see, you will basically get a view of what object(s) in a filesystem changed (effectively, what inode numbers changed).

(ZFS has an interesting hack to make things like 'zfs diff' work far more efficiently than you would expect in light of this, but that's going to take yet another entry to cover.)

ZFSBlockPointers written at 23:19:40; Add Comment

A broad overview of how ZFS is structured on disk

When I wrote yesterday's entry, it became clear that I didn't understand as much about how ZFS is structured on disk (and that this matters, since I thought that ZFS copy on write updates updated a lot more than they do). So today I want to write down my new broad understanding of how this works.

(All of this can be dug out of the old, draft ZFS on-disk format specification, but that spec is written in a very detailed way and things aren't always immediately clear from it.)

Almost everything in ZFS is in DMU object. All objects are defined by a dnode, and object dnodes are almost always grouped together in an object set. Object sets are themselves DMU objects; they store dnodes as basically a giant array in a 'file', which uses data blocks and indirect blocks and so on, just like anything else. Within a single object set, dnodes have an object number, which is the index of their position in the object set's array of dnodes.

(Because an object number is just the index of the object's dnode in its object set's array of dnodes, object numbers are basically always going to be duplicated between object sets (and they're always relative to an object set). For instance, pretty much every object set is going to have an object number ten, although not all object sets may have enough objects that they have an object number ten thousand. One corollary of this is that if you ask zdb to tell you about a given object number, you have to tell zdb what object set you're talking about. Usually you do this by telling zdb which ZFS filesystem or dataset you mean.)

Each ZFS filesystem has its own object set for objects (and thus dnodes) used in the filesystem. As I discovered yesterday, every ZFS filesystem has a directory hierarchy and it may go many levels deep, but all of this directory hierarchy refers to directories and files using their object number.

ZFS organizes and keeps track of filesystems, clones, and snapshots through the DSL (Dataset and Snapshot Layer). The DSL has all sorts of things; DSL directories, DSL datasets, and so on, all of which are objects and many of which refer to object sets (for example, every ZFS filesystem must refer to its current object set somehow). All of these DSL objects are themselves stored as dnodes in another object set, the Meta Object Set, which the uberblock points to. To my surprise, object sets are not stored in the MOS (and as a result do not have 'object numbers'). Object sets are always referred to directly, without indirection, using a block pointer to the object set's dnode.

(I think object sets are referred to directly so that snapshots can freeze their object set very simply.)

The DSL directories and datasets for your pool's set of filesystems form a tree themselves (each filesystem has a DSL directory and at least one DSL dataset). However, just like in ZFS filesystems, all of the objects in this second tree refer to each other indirectly, by their MOS object number. Just as with files in ZFS filesystems, this level of indirection limits the amount of copy on write updates that ZFS had to do when something changes.

PS: If you want to examine MOS objects with zdb, I think you do it with something like 'zdb -vvv -d ssddata 1', which will get you object number 1 of the MOS, which is the MOS object directory. If you want to ask zdb about an object in the pool's root filesystem, use 'zdb -vvv -d ssddata/ 1'. You can tell which one you're getting depending on what zdb prints out. If it says 'Dataset mos [META]' you're looking at objects from the MOS; if it says 'Dataset ssddata [ZPL]', you're looking at the pool's root filesystem (where object number 1 is the ZFS master node).

PPS: I was going to write up what changed on a filesystem write, but then I realized that I didn't know how blocks being allocated and freed are reflected in pool structures. So I'll just say that I think that ignoring free space management, only four DMU objects get updated; the file itself, the filesystem's object set, the filesystem's DSL dataset object, and the MOS.

(As usual, doing the research to write this up taught me things that I didn't know about ZFS.)

ZFSBroadDiskStructure written at 01:16:58; Add Comment

(Previous 10 or go back to June 2018 at 2018/06/22)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.