Wandering Thoughts


Some things on how ZFS System Attributes are stored

To summarize, ZFS's System Attributes (SAs) are a way for ZFS to pack a somewhat arbitrary collection of additional information, such as the parent directory of things and symbolic link targets, into ZFS dnodes in a general and flexible way that doesn't hard code the specific combinations of attributes that can be used together. ZFS system attributes are normally stored in extra space in dnodes that's called the bonus buffer, but the system attributes can overflow to a spill block if necessary. I've written more about the high level side of this in my entry on ZFS SAs, but today I'm going to write up some concrete details of what you'd see when you look at a ZFS filesystem with tools like zdb.

When ZFS stores the SAs for a particular dnode, it simply packs all of their values together in a blob of data. It knows which part of the blob is which through an attribute layout, which tells it which attributes are in the layout and in what order. Attribute layouts are created and registered as they are needed, which is to say when some dnode wants to use that particular combination of attributes. Generally there are only a few combinations of system attributes that get used, so a typical ZFS filesystem will not have many SA layouts. System attributes are numbered, but the specific numbering may differ from filesystem to filesystem. In practice it probably mostly won't, since most attributes usually get registered pretty early in the life of a ZFS filesystem and in a predictable order.

(For example, the creation of a ZFS filesystem necessarily means creating a directory dnode for its top level, so all of the system attributes used for directories will immediately get registered, along with an attribute layout.)

The attribute layout for a given dnode is not fixed when the file is created; instead, it varies depending on what system attributes that dnode needs at the moment. The high level ZFS code simply sets or clears specific system attributes on the dnode, and the low(er) level system attribute code takes care of either finding or creating an attribute layout that matches the current set of attributes the dnode has. Many system attributes are constant over the life of the dnode, but I think others can come and go, such as the system attributes used for xattrs.

Every ZFS filesystem with system attributes has three special dnodes involved in this process, which zdb will report as the "SA master node", the "SA attr registration" dnode, and the "SA attr layouts" dnode. As far as I know, the SA master node's current purpose is to point to the other two dnodes. The SA attribute registry dnode is where the potentially filesystem specific numbers for attributes are registered, and the SA attribute layouts dnode is where the various layouts in use on the filesystem are tracked. The SA master (d)node itself is pointed to by the "ZFS master node", which is always object 1.

So let's use zdb to take a look at a typical case:

# zdb -dddd fs19-scratch-01/w/430 1
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        1    1   128K    512     8K     512    512  100.00  ZFS master node
               SA_ATTRS = 32 
# zdb -dddd fs19-scratch-01/w/430 32
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       32    1   128K    512      0     512    512  100.00  SA master node
               LAYOUTS = 36 
               REGISTRY = 35 

It's common for the registry and the layout to be consecutive, since they're generally allocated at the same time. On most filesystems they will have very low object numbers, since they were created when the filesystem was.

The registry is generally going to be pretty boring looking:

# zdb -dddd fs19-scratch-01/w/430 35
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       35    1   128K  1.50K     8K     512  1.50K  100.00  SA attr registration
       ZPL_SCANSTAMP =  20030012 : [32:3:18]
       ZPL_RDEV =  800000a : [8:0:10]
       ZPL_FLAGS =  800000b : [8:0:11]
       ZPL_GEN =  8000004 : [8:0:4]
       ZPL_MTIME =  10000001 : [16:0:1]
       ZPL_CTIME =  10000002 : [16:0:2]
       ZPL_XATTR =  8000009 : [8:0:9]
       ZPL_UID =  800000c : [8:0:12]
       ZPL_ZNODE_ACL =  5803000f : [88:3:15]
       ZPL_PROJID =  8000015 : [8:0:21]
       ZPL_ATIME =  10000000 : [16:0:0]
       ZPL_SIZE =  8000006 : [8:0:6]
       ZPL_LINKS =  8000008 : [8:0:8]
       ZPL_PARENT =  8000007 : [8:0:7]
       ZPL_MODE =  8000005 : [8:0:5]
       ZPL_PAD =  2000000e : [32:0:14]
       ZPL_DACL_ACES =  40013 : [0:4:19]
       ZPL_GID =  800000d : [8:0:13]
       ZPL_CRTIME =  10000003 : [16:0:3]
       ZPL_DXATTR =  30014 : [0:3:20]
       ZPL_DACL_COUNT =  8000010 : [8:0:16]
       ZPL_SYMLINK =  30011 : [0:3:17]

The names of these attributes come from the enum of known system attributes in zfs_sa.h. The important bit of the values of them is the '[16:0:1]' portion, which is a decoded version of the raw number. The format of the raw number is covered in sa_impl.h, but the short version is that the first number is the total length of the attribute's value, in bytes, the third is its attribute number within the filesystem, and then middle number is an index of how to byteswap it if necessary (and sa.c has a nice comment about the whole scheme at the top).

(The attributes with a listed size of 0 store their data in extra special ways that are beyond the scope of this entry.)

The more interesting thing is the SA attribute layouts:

# zdb -dddd fs19-scratch-01/w/430 36
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       36    1   128K    16K    16K     512    32K  100.00  SA attr layouts
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]
    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]

This particular filesystem has three attribute layouts that have been used by dnodes, and as you can see they are mostly the same. Layout 3 is the common subset, with all of the basic inode attributes you'd expect in a Unix filesystem; layout 2 adds attribute 21 (ZPL_PROJID), and layout 4 adds attribute 17 (ZPL_SYMLINK).

It's possible to have a lot more layouts than this. Here is the collection of layouts for my home desktop's home directory filesystem (which uses the same registered attribute numbers as the filesystem above, so you can look up there for them):

    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  9 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    7 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19  9 ]
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
    5 = [ 5  6  4  12  13  7  11  0  1  2  3  8  10  16  19 ]
    6 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]

Incidentally, notice how these layout numbers aren't the same as the layout numbers on the first filesystem; layout 3 on the first filesystem is layout 2 on my home directory filesystem, layout 4 (symlinks) is layout 3, and layout 2 (project ID) is layout 6. The additional layouts in my home directory filesystem add xattrs (id 9) or 'rdev' (id 10) to some combination of the other attributes.

One of the interesting aspects of this is that you can use the SA attribute layouts to tell if a ZFS filesystem definitely doesn't have some sort of files in it. For example, we know that there are no device special files or files with xattrs in /w/430, because there are no SA attribute layouts that include those attributes. And neither of these two filesystems have ever had ACLs set on any of their files, because neither of them have layouts with either SA ACL attributes.

(Attribute layouts are never removed once created, so a filesystem with a layout with the 'rdev' attribute in it may still not have any device special files in it right now; they could all have been removed.)

Unfortunately, I can't see any obvious way to get zdb to tell you what the current attribute layout is for a specific dnode. At best you have to try to deduce it from what 'zdb -dddd' will print for the dnode's attributes.

(I've recently acquired a reason to dig into the details of ZFS system attributes.)

Sidebar: A brief digression on xattrs in ZFS

As covered in zfsprops(7)'s section on 'xattr=', there are two storage schemes for xattrs in ZFS (well, in OpenZFS on Linux and FreeBSD). At the attribute level, 'ZPL_XATTR' is the older, more general 'store it in directories and files' approach, while 'ZPL_DXATTR' is the 'store it as part of system attributes' one ('xattr=sa'). When dumping a dnode in zdb, zdb will directly print SA xattrs, but for directory xattrs it simply reports 'xattr = <object id>', where the object ID is for the xattr directory. To see the names of the xattrs set on such a file, you need to also dump the xattr directory object with zdb.

(Internally the SA xattrs are stored as a nvlist, because ZFS loves nvlists and nvpairs, more or less because Solaris did at the time.)

ZFSSystemAttributesStorage written at 23:23:41; Add Comment


ZFS's transactional guarantees from a user perspective

I said recently on the Fediverse that ZFS's transactional guarantees were rather complicated both with and without fsync(). I've written about these before in terms of transaction groups and the ZFS Intent Log (ZIL), but that obscured the user visible behavior under the technical details. So here's an attempt at describing just the visible behavior, hopefully in a way that people can follow despite how it gets complicated.

ZFS has two levels of transactional behavior. The basic layer is what happens when you don't use fsync() (or the filesystem is ignoring it). At this level, all changes to a ZFS filesystem are strongly ordered by the time they happened. ZFS may lose some activity at the end, but if you did operation A before operation B and there is a crash, the possible options of what is there afterward is nothing, A, or A and B; you can never have B without A. This strictly time ordered view of filesystem changes is periodically flushed to disk by ZFS; in modern ZFS, such a flush is typically started every five seconds (although completing a flush can take some time). This is generally called a transaction group (txg) commit.

The second layer of transactional behavior comes in if you fsync() something. When you fsync() something (and fsync is enabled on the filesystem, which is the default), all uncommitted metadata changes are immediately flushed to disk along with whatever uncommitted file data changes you requested a fsync() for (if you fsync'd a file instead of a directory). If several processes request fsync()s at once, all of their requests will be merged together, so a single immediate flush may include data for multiple files. Uncommitted file changes that no one requested a fsync() for will not be immediately flushed and will instead wait for the next regular non-fsync() flush (the next txg commit).

(This is relatively normal behavior for fsync(), except that on most filesystems a fsync() doesn't immediately flush all metadata changes. Metadata changes include things like creating, renaming, or removing files.)

A fsync() can break the strict time order of ZFS changes that exists in the basic layer. If you write data to A, write data to B, fsync() B but not A, and ZFS crashes immediately, the data for B will still be there but the change to A may have been lost. In some situations this can result in zero length files even though they were intended to have data. However, if enough time goes by everything from before the fsync() will have been flushed out as part of the non-fsync() flush process.

As a technical detail, ZFS makes it so that all of the changes that are part of a particular periodic flush are tied to each other (if there have been no fsyncs to meddle with the ordering); either all of them will appear after a crash or none of them will. This can be used to create atomic groups of changes that will always appear together (or be lost together), by making sure that all changes are part of the same periodic flush (in ZFS jargon, they are part of the same transaction group (txg)). However, ZFS doesn't give programs any explicit way to do this, and this atomic grouping can be messed up if someone fsync()s at an inconvenient time.

ZFSTransactionalBehavior written at 22:58:10; Add Comment


The flow of activity in the ZFS Intent Log (as I understand it)

The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.

(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)

When you perform activities that modify a ZFS dataset, each activity creates its own ZIL log record (a transaction in ZIL jargon, sometimes called an 'itx', probably short for 'intent transaction') that is put into that dataset's in-memory ZIL log. This includes both straightforward data writes and metadata activity like creating or renaming files. You can see a big list of all of the possible transaction types in zil.h as all of the TX_* definitions (which have brief useful comments). In-memory ZIL transactions aren't necessarily immediately flushed to disk, especially for things like simply doing a write() to a file. The reason that plain write()s to a file are (still) given ZIL transactions is that you may call fsync() on the file later. If you don't call fsync() and the regular ZFS transaction group commits with your write()s, those ZIL transactions will be quietly cleaned out of the in-memory ZIL log (along with all of the other now unneeded ZIL transactions).

(All of this assumes that your dataset doesn't have 'sync=disabled' set, which turns off the in-memory ZIL as one of its effects.)

When you perform an action such as fsync() or sync() that requests that in-memory ZFS state be made durable on disk, ZFS gathers up some or all of those in-memory ZIL transactions and writes them to disk in one go, as a sequence of log (write) blocks ('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL transaction records. This is called a ZIL commit. Depending on various factors, the flushed out data you write() may or may not be included in the log (write) blocks committed to the (dataset's) ZIL. Sometimes your file data will be written directly into its future permanent location in the pool's free space (which is safe) and the ZIL commit will have only a pointer to this location (its DVA).

(For a discussion of this, see the comments about the WR_* constants in zil.h. Also, while in memory, ZFS transactions are classified as either 'synchronous' or 'asynchronous'. Sync transactions are always part of a ZIL commit, but async transactions are only included as necessary. See zil_impl.h and also my entry discussing this.)

It's possible for several processes (or threads) to all call sync() or fsync() at once (well, before the first one finishes committing the ZIL). In this case, their requests can all be merged together into one ZIL commit that covers all of them. This means that fsync() and sync() calls don't necessarily match up one to one with ZIL commits. I believe it's also possible for a fsync() or sync() to not result in a ZIL commit if all of the relevant data has already been written out as part of a regular ZFS transaction group (or a previous request).

Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).

ZFSZILActivityFlow written at 21:58:13; Add Comment


One reason that ZFS can't turn a directory into a filesystem

One of the wishes that I and other people frequently have for ZFS is the ability to take an existing directory (and everything underneath it) in a ZFS filesystem and turn it into a sub-filesystem of its own. One reason for wanting this is that a number of things are set and controlled on a per-filesystem basis in ZFS, instead of on a per-directory basis; if you have a (sub)directory where you want any special value for those, you need to make it a filesystem of its own. Often you may not immediately realize this before the directory exists and has been populated, and you discover the need for the special setting values. Today I realized that one reason ZFS doesn't have this feature is because of how ZFS filesystems are put together.

ZFS is often described as tree structured, and this is generally true; a lot of things in a ZFS pool are organized into a tree of objects. However, while filesystems are a tree at the logical level of directories and subdirectories, they aren't a tree as represented on disk. Directories in ZFS filesystems don't directly point to the disk addresses of their contents; instead, ZFS filesystems have a flat, global table of object numbers (effectively inode numbers) and all directory entries refer to things by object number. Since ZFS is a copy on write filesystem, this level of indirection is quite important in reducing how much has to be updated when a file or a directory is changed.

If ZFS filesystems used tree structured references at the level of directory entries (and we ignored hardlinks), it would make conceptual sense that you could take a directory object, pull it into a new filesystem, and patch its reference in its parent directory. All of the object references in the tree under the directory would stay the same; they would just be in a new container, the new filesystem. Filesystems would essentially be cut points in the overall object tree.

However, you can't make this model work when filesystems have a single global space of object numbers that are used in directory entries. A new filesystem has its own new table of object numbers, and you would have to move all of the objects referred to by the directory hierarchy into this new table, which means you'd have to walk the directory tree to find them all and then possibly update all of the directories if you changed their object numbers as part of putting them in a new object (number) table. This isn't the sort of work that you should be asking a filesystem to do in the kernel; it's much more suited for a user level tool.

Now that I've thought of this, it's even more understandable how ZFS doesn't have this feature, however convenient for me it would be, and how it never will.

(Hardlinks by themselves probably cause enough heartburn to sink a feature to turn a directory into a filesystem, although I can see ways to deal with them if you try hard enough.)

ZFSWhyNotDirectoryToFilesystem written at 22:43:07; Add Comment


How changing a ZFS filesystem's recordsize affects existing files

The ZFS 'recordsize' property (on filesystems) is a famously confusing ZFS property that is more or less the maximum logical block size of files on the filesystem. However, you're allowed to change it even after the filesystem has been created, so this raises the question of what happens with existing files when you do so. The simple answer is that existing files are unaffected and continue to use the old recordsize. The technical answer can be a lot more complicated, and to understand it I'm going to start from the beginning.

Simplifying slightly, ZFS files (in fact all filesystem objects) are made up of zero or more logical blocks, which are all the same logical size (they may be different physical sizes on disk, for example because of ZFS compression). How big these blocks are is the file's (current) (logical) block size; all files have a logical block size. Normally there are two cases for the block size; either there is one block and it and the logical block size are growing up toward the filesystem's recordsize, or there's more than one logical block and the file's logical block size is frozen; all additional logical blocks added to the file will use the file's logical block size, whatever that is.

Under normal circumstances, a file will get its second logical block (and freeze its logical block size) at the point where its size goes about the filesystem's recordsize, which makes the file's final logical block size be the filesystem's recordsize. Sometimes this will be right away, when the file is created, because you wrote enough data to it on the spot. Sometimes this might be months after it's created, if you're appending a little bit of data to a log file once a day and your filesystem has the default 128 Kbyte recordsize (or a larger one).

So now we have some cases when you change recordsize. The simple one is that all files that have more than one logical block stick at the old recordsize, because that is their final logical block size (assuming you've only changed recordsize once; if you've changed it more than once, you could have a mixture of these sizes).

If a file has only a single logical block, its size (and logical block size) might be either below or above the new recordsize. If the file's single logical block is smaller than the new recordsize, the single block will grow up to that new recordsize and after that create its second logical block and freeze its logical block size. This is just the same as if the file was created from scratch under your new recordsize. This means that if you raise the recordsize, all files that currently have only one block will wind up using your new recordsize.

If the file's single block is larger than your new recordsize, the file will continue growing the logical block size of this single block up until it hits the next power of two size; after that it adds a second logical block and freezes the file's logical block size. So if you started out with a 512k recordsize, wrote a 200k file, set recordsize down to 128k, and continue writing to the file, it will wind up with a 256k logical block size. If you had written exactly 256k before lowering recordsize, the file would not grow its logical block size to 512k, because it would already be at a power of two.

In other words, there is an invariant that ZFS files with more than one block always have a logical block size that is a power of two. This likely exists because it makes it much easier to calculate which logical block a given offset falls into.

This means that if you lower the recordsize, all current files that have only a single block may wind up with an assortment of logical block sizes, depending on their current size and your new recordsize. If you drop recordsize from 128k to 16k, you could wind up with a collection of files that variously have 32k, 64k, and 128k logical block sizes.

(This is another version of an email I wrote to the ZFS mailing list today.)

ZFSRecordsizeChangeEffects written at 23:24:52; Add Comment


The paradox of ZFS ARC non-growth and ARC hit rates

We have one ZFS fileserver that sometimes spends quite a while (many hours) with a shrunken ARC size, one tens of gigabytes below its (shrunken) ARC target size. Despite that, its ARC hit rate is still really high. Well, actually, that's not surprising; that's kind of a paradox of ARC growth (for both actual size and target size). This is because the combination of two obvious things: the ARC only grows when it needs to, and a high ARC hit rate means that the ARC isn't seeing much need to grow. More specifically, for reads the ARC only grows when there is a read ARC miss. If your ARC target size is 90 GB, your current ARC size is 40 GB, and your ARC hit rate is 100%, it doesn't matter than you have 50 GB of spare RAM, because the ARC has pretty much nothing to put in it.

This means that your ARC growth rate will usually be correlated with your ARC miss rate, or rather your ARC miss volume (which unfortunately I don't think there are kstats for). The other thing ARC growth rate can be correlated with is with your write volume (because many writes go into the ARC on their way to disk, although I'm not certain all of them do). However, ARC growth from write volume can be a transient thing; if you write something and then delete it, ZFS will first put it in the ARC and then drop it from the ARC.

(Deleting large amounts of data that was in the ARC is one way to rapidly drop the ARC size. If your ARC size shrinks rapidly without the target size shrinking, this is probably what's happened. This data may have been recently written, or it might have been read and then deleted.)

This is in a sense both obvious and general. All disk caches only increase their size while reading if there are cache misses; if they don't have cache misses, nothing happens. ZFS is only unusual in that we worry and obsess over the size of the ARC and how it fluctuates, rather than assuming that it will all just work (for good reasons, especially on Linux, but even on Solaris and later Illumos, the ZFS ARC size was by default constrained to much less than the regular disk cache might have grown to without ZFS).

ZFSARCGrowthParadox written at 21:48:38; Add Comment


Understanding ZFS ARC hit (and miss) kstat statistics

The ZFS ARC exposes a number of kstat statistics about its hit and miss performance, which are obviously quite relevant for understanding if your ARC size and possibly its failure to grow are badly affecting you, or if your ARC hit rate is fine even with a smaller than expected ARC size. Complicating the picture are things like 'MFU hits' and 'MFU ghost hits', where it may not be clear how they relate to plain 'ARC hits'.

There are a number of different things that live in the ZFS ARC, each of which has its own size. Further, the disk blocks in the ARC (both 'data' and 'metadata') are divided between a Most Recently Used (MRU) portion and a Most Frequently Used (MFU) portion (I believe other things like headers aren't in either the MRU or MFU). As covered in eg ELI5: ZFS Caching, the MFU and MRU also have 'ghost' versions of themselves; to simplify, these track what would be in memory if the MFU (or MRU) portion used all of memory.

The MRU, MFU, and the ghost versions of themselves give us our first set of four hit statistics: 'mru_hits', 'mfu_hits', 'mru_ghost_hits', and 'mfu_ghost_hits'. These track blocks that were found in the real MRU or found in the real MFU, in which case they are actually in RAM, or found in the ghost MRU amd MFU, in which case they weren't in RAM but theoretically could have been. As covered in ELI5: ZFS Caching, ZFS tracks the hit rates of the ghost MRU and MFU as signs for when to change the balance between the size of the MRU and MFU. If a block wasn't even in the ghost MFU or MRU, there is no specific kstat for it and we have to deduce that from comparing MRU and MFU ghost hits with general misses.

However, what we really care about for ARC hits and misses is whether the block actually was in the ARC (in RAM) or whether it had to be read off disk. This is what the general 'hits' and 'misses' kstats track, and they do this independently of the MRU and MFU hits (and ghost 'hits'). At this level, all hits and misses can be broken down into one of four categories; demand data, demand metadata, prefetch data, and prefecth metadata (more on this breakdown is in my entry on ARC prefetch stats). Each of these four has hit and miss kstats associated with them, named things like 'demand_data_misses'. As far as I understand it, a 'prefetch' hit or miss means that ZFS was trying to prefetch something and either already found it in the ARC or didn't. A 'demand' read is from ZFS needing it right away.

(This implies that the same ZFS disk block can be a prefetch miss, which reads it into the ARC from disk, and then later a demand hit, when the prefetching paid off and the actual read found it in the ARC.)

In the latest development version of OpenZFS, which will eventually become 2.2, there is an additional category of 'iohits'. An 'iohit' happens when ZFS wants a disk block that already has active IO issued to read it into the ARC, perhaps because there is active prefetching on it. Like 'hits' and 'misses', this has the four demand vs prefetch and data vs metadata counters associated with it. I'm not quite sure how these iohits are counted in OpenZFS 2.1, and some of them may slip through the cracks depending on the exact properties associated with the read (although the change that introduced iohits suggests that they may previously have been counted as 'hits').

If you want to see how your ARC is doing, you want to look at the overall hits and misses. The MRU and MFU hits, especially the 'ghost' hits (which are really misses) strike me as less interesting. If you have ARC misses happening (which leads to actual read IO) and you want to know roughly why, you want to look at the breakdown of the demand vs prefetch and data vs metadata 'misses' kstats.

It's tempting to look at MRU and MFU ghost 'hits' as a percentage of misses, but I'm not sure this tells you much; it's certainly not very high on our fileservers. Somewhat to my surprise, the sum of MFU and MRU hits is just slightly under the overall number of ARC 'hits' on all of our fileservers (which use ZoL 2.1). However, they're exactly the same on my desktops, which run the development version of ZFS on Linux and so have an 'iohits'. So possibly in 2.1, you can infer the number of 'iohits' from the difference between overall hits and MRU + MFU hits.

(I evidently worked much of this out years ago since our ZFS ARC stats displays in our Grafana ZFS dashboards work this way, but I clearly didn't write it down back then. This time around, I'm fixing that for future me.)

ZFSUnderstandingARCHits written at 23:15:30; Add Comment


The various sizes of the ZFS ARC (as of OpenZFS 2.1)

The ZFS ARC is ZFS's version of a disk cache. Further general information on it can be found in two highly recommended sources, Brendan Gregg's 2012 Activity of the ZFS ARC and Allan Jude's FOSDEM 2019 ELI5: ZFS Caching (also, via). ZFS exposes a lot of information about the state of the ARC through kstats, but there isn't much documentation about what a lot of them mean. Today we're going to talk about some of the kstats related to size of the ARC. I'll generally be using the Linux OpenZFS kstat names exposed in /proc/spl/kstat/zfs/arcstats.

The current ARC total size in bytes is size. The ARC is split into a Most Recently Used (MRU) portion and a Most Frequently Used (MFU) portion; the two sizes of these are mru_size and mfu_size. Note that the ARC may contain more than MRU and MFU data; it also holds other things, so size is not necessarily the same as the sum of mru_size and mfu_size.

The ARC caches both ZFS data (which includes not just file contents but also the data blocks of directories) and metadata (ZFS dnodes and other things). All space used by the ARC falls into one of a number of categories, which are accounted for in the following kstats:

data_size metadata_size
bonus_size dnode_size dbuf_size
hdr_size l2_hdr_size abd_chunk_waste_size

('abd' is short for 'ARC buffered data'. In Linux you can see kstats related to it in /proc/spl/kstat/zfs/abdstats.)

Generally data_size and metadata_size will be the largest two components of the ARC size; I believe they cover data actually read off disk, with the other sizes being ZFS in-RAM data structures that are still included in the ARC. The l2_hdr_size will be zero if you have no L2ARC. There is also an arc_meta_used kstat; this rolls up everything except data_size and abd_chunk_waste_size as one number that is basically 'metadata in some sense'. This combined number is important because it's limited by arc_meta_limit.

(There is also an arc_dnode_limit, which I believe effectively limits dnode_size specifically, although dnode_size can go substantially over it under some circumstances.)

When ZFS reads data from disk, in the normal configuration it stores it straight into the ARC in its on-disk form. This means that it may be compressed; even if you haven't turned on ZFS on disk compression for your data, ZFS uses it for metadata. The ARC has two additional sizes to reflect this; compressed_size is the size in RAM, and uncompressed_size is how much this would expand to if it was all uncompressed. There is also overhead_size, which, well, let's quote include/sys/arc_impl.h:

Number of bytes stored in all the arc_buf_t's. This is classified as "overhead" since this data is typically short-lived and will be evicted from the arc when it becomes unreferenced unless the zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level values have been set (see comment in dbuf.c for more information).

Things counted in overhead_size are not counted in the compressed and uncompressed size; they move back and forth in the code as their state changes. I believe that the compressed size plus the overhead size will generally be equal to data_size + metadata_size, ie both cover 'what is in RAM that has been pulled off disk', but in different forms.

Finally we get to the ARC's famous target size, the famous (or infamous) 'arc_c' or just 'c'. This is the target size of the ARC; if it is larger than size, the ARC will grow as you read (or write) things that aren't in it, and if it's smaller than size the ARC will shrink. The ARC's actual size can shrink for other reasons, but the target size shrinking is a slower and more involved thing to recover from.

In OpenZFS 2.1 and before, there is a second target size statistic, 'arc_p' or 'p' (in arcstats); this is apparently short for 'partition', and is the target size for the Most Recently Used (MRU) portion of the ARC. The target size for the MFU portion is 'c - p' and isn't explicitly put into kstats. How 'c' (and 'p') get changed is a complicated topic that is going in another entry.

(In the current development version of OpenZFS, there's a new and different approach to MFU/MRU balancing (via); this will likely be in OpenZFS 2.2, whenever that is released, and may appear in a system near you before then, depending. The new system is apparently better, but its kstats are more opaque.)

Appendix: The short form version

size Current ARC size in bytes. It is composed of
data_size + metadata_size + bonus_size + dnode_size + dbuf_size + hdr_size + l2_hdr_size + abd_chunk_waste_size
arc_meta_used All of size other than data_size + abd_chunk_waste_size; 'metadata' in a broad sense, as opposed to the narrow sense of metadata_size.
mru_size Size of the MRU portion of the ARC
mfu_size Size of the MFU portion of the ARC
arc_meta_limit Theoretical limit on arc_meta_used
arc_dnode_limit Theoretical limit on dnode_size
c aka arc_c The target for size
p aka arc_p The target for mru_size
c - p The target for mfu_size

I believe that generally the following holds:

compressed_size + overhead_size = data_size + metadata_size

In OpenZFS 2.1 and earlier, there is no explicit target for MRU data as separate from MRU metadata. In OpenZFS 2.2, there will be.

ZFSARCItsVariousSizes written at 23:37:55; Add Comment


An interesting yet ordinary consequence of ZFS using the ZIL

On the Fediverse, Alan Coopersmith recently shared this:

@bsmaalders @cks writing a temp file and renaming it also avoids the failure-to-truncate issues found in screenshot cropping tools recently (#aCropalypse), but as some folks at work recently discovered, you need to be sure to fsync() before the rename, or a failure at the wrong moment can leave you with a zero-length file instead of the old one as the directory metadata can get written before the file contents data on ZFS.

On the one hand, this is perfectly ordinary behavior for a modern filesystem; often renames are synchronous and durable, but if you create a file, write it, and then rename it to something else, you haven't insured that the data you wrote is on disk, just that the renaming is. On the other hand, as someone who's somewhat immersed in ZFS this initially felt surprising to me, because ZFS is one of the rare filesystems that enforces a strict temporal order on all IO operations in its core IO model of ZFS transaction groups.

How this works is that everything that happens in a ZFS filesystem goes into a transaction group (TXG). At any give time there's only one open TXG and TXGs commit in order, so if B is issued after A, either it's in the same TXG as A the two happen together or it's in a TXG after A and so A has already happened. In transaction groups, you can never have B happen but A not happen. In the TXG mental model of ZFS IO, this data loss is impossible, since the rename happened after the data write.

However, all of this strict TXG ordering goes out the window once you introduce the ZFS Intent Log (ZIL), because the ZIL's entire purpose is to persist selected operations to disk before they're committed as part of a transaction group. Renames and file creations always go in the ZIL (along with various other metadata operations), but file data only goes in the ZIL if you fsync() it (this is a slight simplification, and file data isn't necessarily directly in the ZIL).

So once the ZIL was in my mental model I could understand what had happened. In effect the presence of the ZIL had changed ZFS from a filesystem with very strong data ordering properties to one with more ordinary ones, and in such a more ordinary filesystem you do need to fsync() your newly written file data to make it durable.

(And under normal circumstances ZFS always has the ZIL, so I was engaging in a bit of skewed system programmer thinking.)

ZFSNaturalZILConsequence written at 22:48:43; Add Comment


I wouldn't use ZFS for swap (either for swapfiles or with a zvol)

As part of broadly charting how Linux finds where to write and read swap blocks, I recently noted that ZFS on Linux can't be used to hold a swapfile. David Magda noted that you could get around this by creating a zvol and using it for swap. While the Linux kernel will accept this and it works, at least to some extent, I wouldn't rely on it and I wouldn't do it unless I was desperate and had no other choice. Fundamentally, swapping to ZFS is not in accordance with what people (and often Unix kernels) expect from writing pages out to swap.

(On Linux, the Arch wiki has some definite cautions.)

Both the Unix kernel and Unix system administrators expect swapping pages out (and reading them back in) to be low overhead operations, things that are very close to writing a block of memory to some disk blocks or reading disk blocks into (preallocated) memory. This is not how ZFS works, even for writes to zvols. Due to ZFS's fundamental decision to never overwrite data in place, writing out blocks to a zvol requires allocating new space for them in the ZFS pool, collecting all of the relevant changes together, and then writing out a transaction group (or perhaps writing the blocks to a ZFS Intent Log). This more or less intrinsically requires allocating memory for various internal ZFS book-keeping, as well as obtaining various ZFS-related locks (for example, ones that protect the data structures that track free blocks). And it winds up doing a lot more IO than just the direct pages of memory being written to swap.

A lot of the time this will all work. Often you aren't swapping under heavy memory pressure, you're just pushing some unused pages out to swap and it's okay for this to take a while and allocate some extra memory and so on. But, regardless of whether it works, it's much more complicated than swapping is normally supposed to be and that makes it more chancy and less predictable. All of this leaves me feeling that swap to ZFS (and to anything similar to it) is for very unusual situations, not normal operation. If I didn't expect to really ever need swap on a system, I think I'd rather have no swap rather than swap on a zvol.

(ZFS isn't the only swap environment that has this problem. Swapping to a file on NFS has many of the same issues, and is also something that I can't recommend.)

ZFS is good for many things, but not everything, and one of the things it's not good at is very low overhead direct write IO. This is by design, since you can't combine it with copy on write and ZFS decided the latter was more important (and I agree with it).

ZFSForSwapMyViews written at 21:52:35; Add Comment

(Previous 10 or go back to October 2022 at 2022/10/31)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.