Understanding ZFS System Attributes

July 15, 2018

Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.

Like most filesystems, ZFS puts these attributes in dnodes using some extra space (in what is called the dnode 'bonus buffer'). However, the ZFS trick is that whatever system attributes a dnode has are simply packed into that space without being organized into formal structures with a fixed order of attributes. Code that uses system attributes retrieves them from dnodes indirectly by asking for, say, the ZPL_PARENT of a dnode; it never cares exactly how they're packed into a given dnode. However, obviously something does.

One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.

(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)

I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.

As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.

By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.


Comments on this page:

By skeeto at 2018-07-15 07:25:36:

This sounds a lot like V8's hidden classes. When particular object instances have a consistent set of fields throughout their lifetime, V8 creates an internal class to represent them. These objects have a fixed layout so that its fields can be accessed efficiently (no dynamic lookup). A ZFS layout number is like a kind of hidden class.

Written on 15 July 2018.
« The challenge of storing file attributes on disk
When I'll probably be able to use Python assignment expressions »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 15 01:11:37 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.