When you make changes, ZFS updates much less stuff than I thought

June 22, 2018

In the past, for example in my entry on how ZFS bookmarks can work with reasonable efficiency, I have given what I think of as the standard explanation of how ZFS's copy on write nature forces changes to things like the data in a file to ripple up all the way to the top of the ZFS hierarchy. To quote myself:

If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, [...]

This is wrong. ZFS is structured so that it doesn't have to ripple changes all the way up through the filesystem just because you changed a piece of it down in the depths of a directory hierarchy.

How this works is through the usual CS trick of a level of indirection. All objects in a ZFS filesystem have an object number, which we've seen come up before, for example in ZFS delete queues. Once it's created, the object number of something never changes. Almost everything in a ZFS filesystem refers to other objects in the filesystem by their object number, not by their (current) disk location. For example, directories in your filesystem refer to things by their object numbers:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testdir
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1003162    1   128K    512      0     512    512  100.00  ZFS directory
    microzap: 512 bytes, 1 entries
       ATESTFILE = 1003019 (type: Regular File)

The directory doesn't tell us where ATESTFILE is on the disk, it just tells us that it's object 1003019.

In order to find where objects are, ZFS stores a per filesystem mapping from object number to actual disk locations that we can sort of think of as a big file; these are called object sets. More exactly, each object number maps to a ZFS dnode, and the ZFS dnodes are stored in what is conceptually an on-disk array ('indexed' by the object number). As far as I can tell, an object's dnode is the only thing that knows where its data is located on disk.

So, suppose that we overwrite data in ATESTFILE. ZFS's copy on write property means that we have to write a new version of the data block, possibly a new version of some number of indirect blocks (if the file is big enough), and then a new version of the dnode so that it points to the new data block or indirect block. Because the dnode itself is part of a block of dnodes in the object set, we must write a new copy of that block of dnodes and then ripple the changes up the indirect blocks and so on (eventually reaching the uberblock as part of a transaction group commit). However, we don't have to change any directories in the ZFS filesystem, no matter how deep the file is in them; while we changed the file's dnode (or if you prefer, the data in the dnode), we didn't change its object number, and the directories only refer to it by object number. It was object number 1003019 before we wrote data to it and it's object number 1003019 after we did, so our cks/tmp/testdir directory is untouched.

Once I thought about it, this isn't particularly different from how conventional Unix filesystems work (what ZFS calls an object number is what we conventionally call an inode number). It's especially forced by the nature of a copy on write Unix filesystem, given that due to hardlinks a file may be referred to from multiple directories. If we had to update every directory a file was linked from whenever the file changed, we'd need some way to keep track of them all, and that would cause all sorts of implementation issues.

(Now that I've realized this it all feels obvious and necessary. Yet at the same time I've been casually explaining ZFS copy on write updates wrong for, well, years. And yes, when I wrote "directory metadata" in my earlier entry, I meant the filesystem directory, not the object set's 'directory' of dnodes.)

Sidebar: The other reason to use inode numbers or object numbers

Although modern filesystems may have 512 byte inodes or dnodes, Unix has traditionally used ones that were smaller than a disk block and thus that were packed several to a (512 byte) disk block. If you need to address something smaller than a disk block, you can't just use the disk block number where the thing is; you need either the disk block number plus an index into it, or you can make things more compact by just having a global index number, ie the inode number.

The original Unix filesystems made life even simpler by storing all inodes in one contiguous chunk of disk space toward the start of the filesystem. This made calculating the disk block that held a given inode a pretty simple process. (For the sake of your peace of mind, you probably don't want to know just how simple it was in V7.)

Comments on this page:

From at 2018-06-23 03:15:04:

Apparently XFS developers didn't know that, either...

Also, a mostly-unrelated but still somewhat amusing quote from the same article:

[Metadata] CoW in XFS is different. Because of the B* trees, it cannot do the leaf-to-tip update that CoW filesystems do; it would require updating laterally as well, which in the worst case means updating the entire filesystem. So CoW in XFS is data-only.

Quite the opposite of ZFS, it seems.

By cks at 2018-06-23 14:55:30:

I think the XFS developers did understand this, although it's not completely clear to me. They carefully talk about an 'index tree', not a directory tree, and I think they mean the tree or table of inodes. ZFS does have an index tree that it has to update whenever you change something; it's just that I think it's a shallow tree. When you change what disk blocks a dnode points to, you have to change the object set it's in, and then you have to change wherever the dnode of the object set is, and so on up until you reach the uberblock.

If I'm reading ZFS disk layout stuff correctly, I believe that this is only a two level tree; you update a dnode in an object set, which changes the object set, and then the object set is referred to by a dnode in the Meta Object Set (MOS), which also has to change, and then we hit the uberblock, which points directly to the MOS. All other references to the first object set use its MOS object number, so they don't have to change.

Written on 22 June 2018.
« Running servers (and services) well is not trivial
A broad overview of how ZFS is structured on disk »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jun 22 22:48:33 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.