Wandering Thoughts archives

2024-06-18

Some things on how ZFS System Attributes are stored

To summarize, ZFS's System Attributes (SAs) are a way for ZFS to pack a somewhat arbitrary collection of additional information, such as the parent directory of things and symbolic link targets, into ZFS dnodes in a general and flexible way that doesn't hard code the specific combinations of attributes that can be used together. ZFS system attributes are normally stored in extra space in dnodes that's called the bonus buffer, but the system attributes can overflow to a spill block if necessary. I've written more about the high level side of this in my entry on ZFS SAs, but today I'm going to write up some concrete details of what you'd see when you look at a ZFS filesystem with tools like zdb.

When ZFS stores the SAs for a particular dnode, it simply packs all of their values together in a blob of data. It knows which part of the blob is which through an attribute layout, which tells it which attributes are in the layout and in what order. Attribute layouts are created and registered as they are needed, which is to say when some dnode wants to use that particular combination of attributes. Generally there are only a few combinations of system attributes that get used, so a typical ZFS filesystem will not have many SA layouts. System attributes are numbered, but the specific numbering may differ from filesystem to filesystem. In practice it probably mostly won't, since most attributes usually get registered pretty early in the life of a ZFS filesystem and in a predictable order.

(For example, the creation of a ZFS filesystem necessarily means creating a directory dnode for its top level, so all of the system attributes used for directories will immediately get registered, along with an attribute layout.)

The attribute layout for a given dnode is not fixed when the file is created; instead, it varies depending on what system attributes that dnode needs at the moment. The high level ZFS code simply sets or clears specific system attributes on the dnode, and the low(er) level system attribute code takes care of either finding or creating an attribute layout that matches the current set of attributes the dnode has. Many system attributes are constant over the life of the dnode, but I think others can come and go, such as the system attributes used for xattrs.

Every ZFS filesystem with system attributes has three special dnodes involved in this process, which zdb will report as the "SA master node", the "SA attr registration" dnode, and the "SA attr layouts" dnode. As far as I know, the SA master node's current purpose is to point to the other two dnodes. The SA attribute registry dnode is where the potentially filesystem specific numbers for attributes are registered, and the SA attribute layouts dnode is where the various layouts in use on the filesystem are tracked. The SA master (d)node itself is pointed to by the "ZFS master node", which is always object 1.

So let's use zdb to take a look at a typical case:

# zdb -dddd fs19-scratch-01/w/430 1
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        1    1   128K    512     8K     512    512  100.00  ZFS master node
[...]
               SA_ATTRS = 32 
[...]
# zdb -dddd fs19-scratch-01/w/430 32
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       32    1   128K    512      0     512    512  100.00  SA master node
[...]
               LAYOUTS = 36 
               REGISTRY = 35 

It's common for the registry and the layout to be consecutive, since they're generally allocated at the same time. On most filesystems they will have very low object numbers, since they were created when the filesystem was.

The registry is generally going to be pretty boring looking:

# zdb -dddd fs19-scratch-01/w/430 35
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       35    1   128K  1.50K     8K     512  1.50K  100.00  SA attr registration
[...]
       ZPL_SCANSTAMP =  20030012 : [32:3:18]
       ZPL_RDEV =  800000a : [8:0:10]
       ZPL_FLAGS =  800000b : [8:0:11]
       ZPL_GEN =  8000004 : [8:0:4]
       ZPL_MTIME =  10000001 : [16:0:1]
       ZPL_CTIME =  10000002 : [16:0:2]
       ZPL_XATTR =  8000009 : [8:0:9]
       ZPL_UID =  800000c : [8:0:12]
       ZPL_ZNODE_ACL =  5803000f : [88:3:15]
       ZPL_PROJID =  8000015 : [8:0:21]
       ZPL_ATIME =  10000000 : [16:0:0]
       ZPL_SIZE =  8000006 : [8:0:6]
       ZPL_LINKS =  8000008 : [8:0:8]
       ZPL_PARENT =  8000007 : [8:0:7]
       ZPL_MODE =  8000005 : [8:0:5]
       ZPL_PAD =  2000000e : [32:0:14]
       ZPL_DACL_ACES =  40013 : [0:4:19]
       ZPL_GID =  800000d : [8:0:13]
       ZPL_CRTIME =  10000003 : [16:0:3]
       ZPL_DXATTR =  30014 : [0:3:20]
       ZPL_DACL_COUNT =  8000010 : [8:0:16]
       ZPL_SYMLINK =  30011 : [0:3:17]

The names of these attributes come from the enum of known system attributes in zfs_sa.h. The important bit of the values of them is the '[16:0:1]' portion, which is a decoded version of the raw number. The format of the raw number is covered in sa_impl.h, but the short version is that the first number is the total length of the attribute's value, in bytes, the third is its attribute number within the filesystem, and then middle number is an index of how to byteswap it if necessary (and sa.c has a nice comment about the whole scheme at the top).

(The attributes with a listed size of 0 store their data in extra special ways that are beyond the scope of this entry.)

The more interesting thing is the SA attribute layouts:

# zdb -dddd fs19-scratch-01/w/430 36
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       36    1   128K    16K    16K     512    32K  100.00  SA attr layouts
[...]
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]
    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]

This particular filesystem has three attribute layouts that have been used by dnodes, and as you can see they are mostly the same. Layout 3 is the common subset, with all of the basic inode attributes you'd expect in a Unix filesystem; layout 2 adds attribute 21 (ZPL_PROJID), and layout 4 adds attribute 17 (ZPL_SYMLINK).

It's possible to have a lot more layouts than this. Here is the collection of layouts for my home desktop's home directory filesystem (which uses the same registered attribute numbers as the filesystem above, so you can look up there for them):

    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  9 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    7 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19  9 ]
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
    5 = [ 5  6  4  12  13  7  11  0  1  2  3  8  10  16  19 ]
    6 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]

Incidentally, notice how these layout numbers aren't the same as the layout numbers on the first filesystem; layout 3 on the first filesystem is layout 2 on my home directory filesystem, layout 4 (symlinks) is layout 3, and layout 2 (project ID) is layout 6. The additional layouts in my home directory filesystem add xattrs (id 9) or 'rdev' (id 10) to some combination of the other attributes.

One of the interesting aspects of this is that you can use the SA attribute layouts to tell if a ZFS filesystem definitely doesn't have some sort of files in it. For example, we know that there are no device special files or files with xattrs in /w/430, because there are no SA attribute layouts that include those attributes. And neither of these two filesystems have ever had ACLs set on any of their files, because neither of them have layouts with either SA ACL attributes.

(Attribute layouts are never removed once created, so a filesystem with a layout with the 'rdev' attribute in it may still not have any device special files in it right now; they could all have been removed.)

Unfortunately, I can't see any obvious way to get zdb to tell you what the current attribute layout is for a specific dnode. At best you have to try to deduce it from what 'zdb -dddd' will print for the dnode's attributes.

(I've recently acquired a reason to dig into the details of ZFS system attributes.)

Sidebar: A brief digression on xattrs in ZFS

As covered in zfsprops(7)'s section on 'xattr=', there are two storage schemes for xattrs in ZFS (well, in OpenZFS on Linux and FreeBSD). At the attribute level, 'ZPL_XATTR' is the older, more general 'store it in directories and files' approach, while 'ZPL_DXATTR' is the 'store it as part of system attributes' one ('xattr=sa'). When dumping a dnode in zdb, zdb will directly print SA xattrs, but for directory xattrs it simply reports 'xattr = <object id>', where the object ID is for the xattr directory. To see the names of the xattrs set on such a file, you need to also dump the xattr directory object with zdb.

(Internally the SA xattrs are stored as a nvlist, because ZFS loves nvlists and nvpairs, more or less because Solaris did at the time.)

solaris/ZFSSystemAttributesStorage written at 23:23:41;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.