2025-06-14
Revisiting ZFS's ZIL, separate log devices, and writes
Many years ago I wrote a couple of entries about ZFS's ZIL optimizations for writes and then an update for separate log devices. In completely unsurprising news, OpenZFS's behavior has changed since then and gotten simpler. The basic background for this entry is the flow of activity in the ZIL (ZFS Intent Log).
When you write data to a ZFS filesystem, your write will be classified as 'indirect', 'copied', or 'needcopy'. A 'copied' write is immediately put into the in-memory ZIL even before the ZIL is flushed to disk, a 'needcopy' write will be put into the in-memory ZIL if a (filesystem) sync() or fsync() happens and then written to disk as part of the ZIL flush, and an 'indirect' write will always be written to its final place in the filesystem even if the ZIL is flushed to disk, with the ZIL just containing a pointer to the regular location (although at that point the ZIL flush depends on those regular writes). ZFS keeps metrics on how much you have of all of these, and they're potentially relevant in various situations.
As of the current development version of OpenZFS (and I believe for some time in released versions), how writes are classified is like this, in order:
- If you have '
logbias=throughput
' set or the write is an O_DIRECT write, it is an indirect write. - If you don't have a separate log device and the write is equal to
or larger than zfs_immediate_write_sz
(32 KBytes by default), it is an indirect write.
- If this is a synchronous write, it is a 'copied' write, including if your
filesystem has 'sync=always'
set.
- Otherwise it's a 'needcopy' write.
If your system is doing normal IO (well, normal writes) and you don't have a separate log device, large writes are indirect writes and small writes are 'needcopy' writes. This keeps both of them out of the in-memory ZIL. However, on our systems I see a certain volume of 'copied' writes, suggesting that some programs or ZFS operations force synchronous writes. This seems to be especially common on our ZFS based NFS fileservers, but it happens to some degree even on the ZFS fileserver that mostly does local IO.
The corollary to this is that if you do have a separate log device and you don't do O_DIRECT writes (and don't set logbias=throughput), all of your writes will go to your log device during ZIL flushes, because they'll fall through the first two cases and into case three or four. If you have a sufficiently high write volume combined with ZIL flushes, this may increase the size of a separate log device that you want and also make you want one that has a high write bandwidth (and can commit things to durable storage rapidly).
(We don't use any separate log devices for various reasons and I don't have well informed views of when you should use them and what sort of device you should use.)
Once upon a time (when I wrote my old entry), there was a zil_slog_limit tunable that pushed some writes back to being indirect writes even if you had a separate log device, under somewhat complex circumstances. That was apparently removed in 2017 and was partly not working even before then (also).
2025-04-22
We've chosen to 'modernize' all of our ZFS filesystems
We are almost all of the way to the end of a multi-month process of upgrading our ZFS fileservers from Ubuntu 22.04 to 24.04 by also moving to more recent hardware. This involved migrating all of our pools and filesystems, involving terabytes of data. Our traditional way of doing this sort of migration (which we used, for example, when going from our OmniOS fileservers to our Linux fileservers was the good old reliable 'zfs send | zfs receive' approach of sending snapshots over. This sort of migration is fast, reliable, and straightforward. However, it has one drawback, which is that it preserves all of the old filesystem's history, including things like the possibility of panics and possibly other things.
We've been running ZFS for long enough that we had some ZFS filesystems that were still at ZFS filesystem version 4. In late 2023, we upgraded them all to ZFS filesystem version 5, and after that we got some infrequent kernel panics. We could never reproduce the kernel panics and they were very infrequent, but 'infrequent' is not the same as 'never' (the previous state of affairs), and it seemed likely that they were in some way related to upgrading our filesystem versions, which in turn was related to us having some number of very old filesystems. So in this migration, we deliberately decided to 'migrate' filesystems the hard way. Which is to say, rather than migrating the filesystems, we migrated the data with user level tools, moving it into pools and filesystems that were created from scratch on our new Ubuntu 24.04 fileservers (which led us to discover that default property values sometimes change in ways that we care about).
(The filesystems reused the same names as their old versions, because that keeps things easier for our people and for us.)
It's possible that this user level rewriting of all data has wound up laying things out in a better way (although all of this is on SSDs), and it's certainly insured that everything has modern metadata associated with it and so on. The 'fragmentation' value of the new pools on the new fileservers is certainly rather lower than the value for most old pools, although what that means is a bit complicated.
There's a bit of me that misses the deep history of our old filesystems, some of which dated back to our first generation Solaris ZFS fileservers. However, on the whole I'm happy that we're now using filesystems that don't have ancient historical relics and peculiarities that may not be well supported by OpenZFS's code any more (and which were only likely to get less tested and more obscure over time).
(Our pools were all (re)created from scratch as part of our migration from OmniOS to Linux, and anyway would have been remade from scratch again in this migration even if we moved the filesystems with 'zfs send'.)
2025-04-14
ZFS's delayed compression of written data (when compression is enabled)
In a comment on my entry about how Unix files have at least two sizes, Leah Neukirchen said that 'ZFS compresses asynchronously' and noted that this could cause the reported block size of a just-written file to change over time. This way of describing ZFS's behavior made me twitch and it took me a bit of thinking to realize why. What ZFS does is delayed compression (which is asynchronous with your user level write() calls), but not true 'asynchronous compression' that happens later at an unpredictable time.
Like basically all filesystems, ZFS doesn't immediately start writing data to disk when you do a write() system call. Instead it buffers this data in memory for a while and only writes it later. As part of this, ZFS doesn't immediately decide where on disk the data will be written (this is often called 'delayed allocation' and is common in many filesystems) and otherwise prepare it to be written out. As part of this delayed allocation and preparation, ZFS doesn't immediately compress your written data, and as a result ZFS doesn't know how many disk blocks your data will take up. Instead your data is only compressed and has disk blocks allocated for it as part of ZFS's pipeline of actually performing IO, when the data is flushed to disk, and only then is its physical block size known.
However, once written to disk, the data's compression or lack of it is never changed (nor is anything else about it; ZFS never modifies data once it's written). For example, data isn't initially written in uncompressed form and then asynchronously compressed later. Nor is there anything that goes around asynchronously compressing or decompressing data if you turn on or off compression on a ZFS filesystem (or change the compression algorithm). This periodically irks people who wish they could turn compression on on an existing filesystem, or change the compression algorithm, and have this take effect 'in place' to shrink the amount of space the filesystem is using.
Delaying compressing data until you're writing it out is a sensible decision for a variety of reasons. One of them is that ZFS compresses your data in potentially large chunks, and you may not write() all of that chunk at once. If you wrote half a chunk now and then half a chunk later before it got flushed to disk, it would be a waste of effort to compress your half a chunk now and then throw the away the work when you compressed the whole chunk.
(I also suspect that it was simpler to add compression to ZFS as part of its IO pipeline than to do it separately. ZFS already had a multi-stage IO pipeline, so adding compression and decompression as another step was probably relatively straightforward.)
2025-03-18
How ZFS knows and tracks the space usage of datasets
Anyone who's ever had to spend much time with 'zfs list -t all -o
space
' knows the basics of ZFS space usage accounting, with space
used by the datasets, data unique to a particular snapshot (the
'USED' value for a snapshot), data used by snapshots in total, and
so on. But today I discovered that I didn't really know how it all
worked under the hood, so I went digging in the source code. The
answer is that ZFS tracks all of these types of space usage directly
as numbers, and updates them as blocks are logically freed.
(Although all of these are accessed from user space as ZFS properties, they're not conventional dataset properties; instead, ZFS materializes the property version any time you ask, from fields in its internal data structures. Some of these fields are different and accessed differently for snapshots and regular datasets, for example what 'zfs list' presents as 'USED'.)
All changes to a ZFS dataset happen in a ZFS transaction (group), which are assigned ever increasing numbers, the 'transaction group number(s)' (txg). This includes allocating blocks, which remember their 'birth txg', and making snapshots, which carry the txg they were made in and necessarily don't contain any blocks that were born after that txg. When ZFS wants to free a block in the live filesystem (either because you deleted the object or because you're writing new data and ZFS is doing its copy on write thing), it looks at the block's birth txg and the txg of the most recent snapshot; if the block is old enough that it has to be in that snapshot, then the block is not actually freed and the space for the block is transferred from 'USED' (by the filesystem) to 'USEDSNAP' (used only in snapshots). ZFS will then further check the block's txg against the txgs of snapshots to see if the block is unique to a particular snapshot, in which case its space will be added to that snapshot's 'USED'.
ZFS goes through a similar process when you delete a snapshot. As it runs around trying to free up the snapshot's space, it may discover that a block it's trying to free is now used only by one other snapshot, based on the relevant txgs. If so, the block's space is added to that snapshot's 'USED'. If the block is freed entirely, ZFS will decrease the 'USEDSNAP' number for the entire dataset. If the block is still used by several snapshots, no usage numbers need to be adjusted.
(Determining if a block is unique in the previous snapshot is fairly easy, since you can look at the birth txgs of the two previous snapshots. Determining if a block is now unique in the next snapshot (or for that matter is still in use in the dataset) is more complex and I don't understand the code involved; presumably it involves somehow looking at what blocks were freed and when. Interested parties can look into the OpenZFS code themselves, where there are some surprises.)
PS: One consequence of this is that there's no way after the fact to find out when space shifted from being used by the filesystem to used by snapshots (for example, when something large gets deleted in the filesystem and is now present only in snapshots). All you can do is capture the various numbers over time and then look at your historical data to see when they changed. The removal of snapshots is captured by ZFS pool history, but as far as I know this doesn't capture how the deletion affected the various space usage numbers.
2024-09-25
Using a small ZFS recordsize doesn't save you space (well, almost never)
ZFS filesystems have a famously confusing 'recordsize
' property,
which in the past I've summarized as the maximum logical block
size of a filesystem object. Sometimes I've
seen people suggest that if you want to save disk space, you should
reduce your 'recordsize
' from the default 128 KBytes. This is
almost invariably wrong; in fact, setting a low 'recordsize
' is
more likely to cost you space.
How a low recordsize costs you space is straightforward. In ZFS,
every logical block requires its own DVA
to point to it and contain its checksum.
The more logical blocks you have, the more DVAs you require and the
more space they take up. As you decrease the 'recordsize
' of a
filesystem, files (well, filesystem objects in general) that are
larger than your recordsize will use more and more logical blocks
for their data and have more and more DVAs, taking up more and more
space.
In addition, ZFS compression operates on logical blocks and must save at least one disk block's
worth of space to be considered worthwhile. If you have compression
turned on (and if you care about space usage, you should), the
closer your 'recordsize
' gets to the vdev's disk block size, the
harder it is for compression to save space. The limit case is when
you make 'recordsize
' be the same size as the disk block size, at
which point ZFS compression can't do anything.
(This is the 'physical disk block size', or more exactly the vdev's 'ashift', which these days should basically always be 4 KBytes or greater, not the disk's 'logical block size', which is usually still 512 bytes.)
The one case where a large recordsize can theoretically cost you
disk space is if you have large files that are mostly holes and you
don't have any sort of compression turned on (which these days
means specifically turning it off). If
you have a (Unix) file that has 1 KByte of data every 128 KBytes
and is otherwise not written to, without compression and with the
default 128 KByte 'recordsize
', you'll get a bunch of 128 KByte
blocks that have 1 KByte of actual data and 127 KBytes of zeroes.
If you reduced your "recordsize
', you would still waste some space
but more of it would be actual holes, with no space allocated.
However, even the most minimal compression (a setting of
'compression=zle')
will entirely eliminate this waste.
(The classical case of reducing 'recordsize
' is helping databases
out. More generally, you reduce 'recordsize
'
when you're rewriting data in place in small sizes (such as 4 KBytes
or 16 KBytes) or appending data to a file in small sizes, because
ZFS can only read and write entire logical blocks.)
PS: If you need a small 'recordsize
' for performance, you shouldn't
worry about the extra space usage, partly because you should also
have a reasonable amount of free disk space to improve the performance
of ZFS's space allocation.
2024-08-11
ZFS properties sometimes change their default values over time
For an assortment of reasons, we don't want ZFS to do compression on most of the filesystems on our fileservers. Some of these reasons are practical technical ones and some of them have to do with our particular local non-technical ('political') decisions around disk space allocation. Traditionally we've done this by the simple mechanism of not specifically enabling compression, because the default was off. Recently I discovered, more or less by coincidence, that OpenZFS had changed the default for ZFS compression from off to on between the version in Ubuntu 22.04 ('v2.1.5' plus Ubuntu changes) and the version in Ubuntu 24.04 ('v2.2.2' plus Ubuntu changes).
(This change was made in early March of 2022 and first appeared in v2.2.0. The change itself is discussed in pull request #13078.)
Another property that changed its default value in OpenZFS v2.2.0
is 'relatime
'.
This was apparently a change to match general Linux behavior, based
on pull request #13614.
Since we already specifically turn atime off, we might want to also
disable relatime now that it defaults to on, or perhaps it won't
have too much of an impact (and in general, atime and relatime may
not work over NFS anyway).
These aren't big changes (and they're perfectly sensible ones), but to me they point what should really have already been obvious, which is that OpenZFS can change the default values of properties over time. When you move to the new version of ZFS, you'll probably inherit these new default values, unless you're explicitly setting the properties to something. If you care about various properties having specific values, it's probably worth explicitly setting those values even if they're the current default.
(To be explicit, I think that OpenZFS should make this sort of changes to defaults when they have good reasons, which I feel they definitely did here. Our issues with compression are unusual and specific to our environment, and dealing with it is our problem.)
2024-06-18
Some things on how ZFS System Attributes are stored
To summarize, ZFS's System Attributes (SAs)
are a way for ZFS to pack a somewhat arbitrary collection of
additional information, such as the parent directory of things and symbolic link targets,
into ZFS dnodes in a general and flexible
way that doesn't hard code the specific combinations of attributes
that can be used together. ZFS system attributes are normally stored
in extra space in dnodes that's called the bonus buffer, but the
system attributes can overflow to a spill block if necessary.
I've written more about the high level side of this in my entry
on ZFS SAs, but today I'm going to write up
some concrete details of what you'd see when you look at a ZFS
filesystem with tools like zdb
.
When ZFS stores the SAs for a particular dnode, it simply packs all of their values together in a blob of data. It knows which part of the blob is which through an attribute layout, which tells it which attributes are in the layout and in what order. Attribute layouts are created and registered as they are needed, which is to say when some dnode wants to use that particular combination of attributes. Generally there are only a few combinations of system attributes that get used, so a typical ZFS filesystem will not have many SA layouts. System attributes are numbered, but the specific numbering may differ from filesystem to filesystem. In practice it probably mostly won't, since most attributes usually get registered pretty early in the life of a ZFS filesystem and in a predictable order.
(For example, the creation of a ZFS filesystem necessarily means creating a directory dnode for its top level, so all of the system attributes used for directories will immediately get registered, along with an attribute layout.)
The attribute layout for a given dnode is not fixed when the file is created; instead, it varies depending on what system attributes that dnode needs at the moment. The high level ZFS code simply sets or clears specific system attributes on the dnode, and the low(er) level system attribute code takes care of either finding or creating an attribute layout that matches the current set of attributes the dnode has. Many system attributes are constant over the life of the dnode, but I think others can come and go, such as the system attributes used for xattrs.
Every ZFS filesystem with system attributes has three special dnodes involved in this process, which zdb will report as the "SA master node", the "SA attr registration" dnode, and the "SA attr layouts" dnode. As far as I know, the SA master node's current purpose is to point to the other two dnodes. The SA attribute registry dnode is where the potentially filesystem specific numbers for attributes are registered, and the SA attribute layouts dnode is where the various layouts in use on the filesystem are tracked. The SA master (d)node itself is pointed to by the "ZFS master node", which is always object 1.
So let's use zdb to take a look at a typical case:
# zdb -dddd fs19-scratch-01/w/430 1 [...] Object lvl iblk dblk dsize dnsize lsize %full type 1 1 128K 512 8K 512 512 100.00 ZFS master node [...] SA_ATTRS = 32 [...] # zdb -dddd fs19-scratch-01/w/430 32 Object lvl iblk dblk dsize dnsize lsize %full type 32 1 128K 512 0 512 512 100.00 SA master node [...] LAYOUTS = 36 REGISTRY = 35
It's common for the registry and the layout to be consecutive, since they're generally allocated at the same time. On most filesystems they will have very low object numbers, since they were created when the filesystem was.
The registry is generally going to be pretty boring looking:
# zdb -dddd fs19-scratch-01/w/430 35 [...] Object lvl iblk dblk dsize dnsize lsize %full type 35 1 128K 1.50K 8K 512 1.50K 100.00 SA attr registration [...] ZPL_SCANSTAMP = 20030012 : [32:3:18] ZPL_RDEV = 800000a : [8:0:10] ZPL_FLAGS = 800000b : [8:0:11] ZPL_GEN = 8000004 : [8:0:4] ZPL_MTIME = 10000001 : [16:0:1] ZPL_CTIME = 10000002 : [16:0:2] ZPL_XATTR = 8000009 : [8:0:9] ZPL_UID = 800000c : [8:0:12] ZPL_ZNODE_ACL = 5803000f : [88:3:15] ZPL_PROJID = 8000015 : [8:0:21] ZPL_ATIME = 10000000 : [16:0:0] ZPL_SIZE = 8000006 : [8:0:6] ZPL_LINKS = 8000008 : [8:0:8] ZPL_PARENT = 8000007 : [8:0:7] ZPL_MODE = 8000005 : [8:0:5] ZPL_PAD = 2000000e : [32:0:14] ZPL_DACL_ACES = 40013 : [0:4:19] ZPL_GID = 800000d : [8:0:13] ZPL_CRTIME = 10000003 : [16:0:3] ZPL_DXATTR = 30014 : [0:3:20] ZPL_DACL_COUNT = 8000010 : [8:0:16] ZPL_SYMLINK = 30011 : [0:3:17]
The names of these attributes come from the enum of known system
attributes in zfs_sa.h
. The
important bit of the values of them is the '[16:0:1]' portion, which
is a decoded version of the raw number. The format of the raw number
is covered in sa_impl.h
, but
the short version is that the first number is the total length of
the attribute's value, in bytes, the third is its attribute number
within the filesystem, and then middle number is an index of how
to byteswap it if necessary
(and sa.c
has a nice comment about the whole scheme at the top).
(The attributes with a listed size of 0 store their data in extra special ways that are beyond the scope of this entry.)
The more interesting thing is the SA attribute layouts:
# zdb -dddd fs19-scratch-01/w/430 36 [...] Object lvl iblk dblk dsize dnsize lsize %full type 36 1 128K 16K 16K 512 32K 100.00 SA attr layouts [...] 2 = [ 5 6 4 12 13 7 11 0 1 2 3 8 21 16 19 ] 4 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 17 ] 3 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 ]
This particular filesystem has three attribute layouts that have been used by dnodes, and as you can see they are mostly the same. Layout 3 is the common subset, with all of the basic inode attributes you'd expect in a Unix filesystem; layout 2 adds attribute 21 (ZPL_PROJID), and layout 4 adds attribute 17 (ZPL_SYMLINK).
It's possible to have a lot more layouts than this. Here is the collection of layouts for my home desktop's home directory filesystem (which uses the same registered attribute numbers as the filesystem above, so you can look up there for them):
4 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 9 ] 3 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 17 ] 7 = [ 5 6 4 12 13 7 11 0 1 2 3 8 21 16 19 9 ] 2 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 ] 5 = [ 5 6 4 12 13 7 11 0 1 2 3 8 10 16 19 ] 6 = [ 5 6 4 12 13 7 11 0 1 2 3 8 21 16 19 ]
Incidentally, notice how these layout numbers aren't the same as the layout numbers on the first filesystem; layout 3 on the first filesystem is layout 2 on my home directory filesystem, layout 4 (symlinks) is layout 3, and layout 2 (project ID) is layout 6. The additional layouts in my home directory filesystem add xattrs (id 9) or 'rdev' (id 10) to some combination of the other attributes.
One of the interesting aspects of this is that you can use the SA attribute layouts to tell if a ZFS filesystem definitely doesn't have some sort of files in it. For example, we know that there are no device special files or files with xattrs in /w/430, because there are no SA attribute layouts that include those attributes. And neither of these two filesystems have ever had ACLs set on any of their files, because neither of them have layouts with either SA ACL attributes.
(Attribute layouts are never removed once created, so a filesystem with a layout with the 'rdev' attribute in it may still not have any device special files in it right now; they could all have been removed.)
Unfortunately, I can't see any obvious way to get zdb to tell you what the current attribute layout is for a specific dnode. At best you have to try to deduce it from what 'zdb -dddd' will print for the dnode's attributes.
(I've recently acquired a reason to dig into the details of ZFS system attributes.)
Sidebar: A brief digression on xattrs in ZFS
As covered in zfsprops(7)'s section on 'xattr=',
there are two storage schemes for xattrs in ZFS (well, in OpenZFS
on Linux and FreeBSD). At the attribute level, 'ZPL_XATTR
' is
the older, more general 'store it in directories and files' approach,
while 'ZPL_DXATTR
' is the 'store it as part of system attributes'
one ('xattr=sa'). When dumping a dnode in zdb, zdb will directly
print SA xattrs, but for directory xattrs it simply reports
'xattr = <object id>', where the object ID is for the xattr directory.
To see the names of the xattrs set on such a file, you need to also
dump the xattr directory object with zdb.
(Internally the SA xattrs are stored as a nvlist, because ZFS loves nvlists and nvpairs, more or less because Solaris did at the time.)
2024-05-28
ZFS's transactional guarantees from a user perspective
I said recently on the Fediverse that ZFS's transactional guarantees were rather complicated both with and without fsync(). I've written about these before in terms of transaction groups and the ZFS Intent Log (ZIL), but that obscured the user visible behavior under the technical details. So here's an attempt at describing just the visible behavior, hopefully in a way that people can follow despite how it gets complicated.
ZFS has two levels of transactional behavior. The basic layer is what happens when you don't use fsync() (or the filesystem is ignoring it). At this level, all changes to a ZFS filesystem are strongly ordered by the time they happened. ZFS may lose some activity at the end, but if you did operation A before operation B and there is a crash, the possible options of what is there afterward is nothing, A, or A and B; you can never have B without A. This strictly time ordered view of filesystem changes is periodically flushed to disk by ZFS; in modern ZFS, such a flush is typically started every five seconds (although completing a flush can take some time). This is generally called a transaction group (txg) commit.
The second layer of transactional behavior comes in if you fsync() something. When you fsync() something (and fsync is enabled on the filesystem, which is the default), all uncommitted metadata changes are immediately flushed to disk along with whatever uncommitted file data changes you requested a fsync() for (if you fsync'd a file instead of a directory). If several processes request fsync()s at once, all of their requests will be merged together, so a single immediate flush may include data for multiple files. Uncommitted file changes that no one requested a fsync() for will not be immediately flushed and will instead wait for the next regular non-fsync() flush (the next txg commit).
(This is relatively normal behavior for fsync(), except that on most filesystems a fsync() doesn't immediately flush all metadata changes. Metadata changes include things like creating, renaming, or removing files.)
A fsync() can break the strict time order of ZFS changes that exists in the basic layer. If you write data to A, write data to B, fsync() B but not A, and ZFS crashes immediately, the data for B will still be there but the change to A may have been lost. In some situations this can result in zero length files even though they were intended to have data. However, if enough time goes by everything from before the fsync() will have been flushed out as part of the non-fsync() flush process.
As a technical detail, ZFS makes it so that all of the changes that are part of a particular periodic flush are tied to each other (if there have been no fsyncs to meddle with the ordering); either all of them will appear after a crash or none of them will. This can be used to create atomic groups of changes that will always appear together (or be lost together), by making sure that all changes are part of the same periodic flush (in ZFS jargon, they are part of the same transaction group (txg)). However, ZFS doesn't give programs any explicit way to do this, and this atomic grouping can be messed up if someone fsync()s at an inconvenient time.
2024-02-19
The flow of activity in the ZFS Intent Log (as I understand it)
The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.
(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)
When you perform activities that modify a ZFS dataset, each activity
creates its own ZIL log record (a transaction in ZIL jargon,
sometimes called an 'itx', probably short for 'intent transaction')
that is put into that dataset's in-memory ZIL log. This includes
both straightforward data writes and metadata activity like creating
or renaming files. You can see a big list of all of the possible
transaction types in zil.h as
all of the TX_*
definitions (which have brief useful comments).
In-memory ZIL transactions aren't necessarily immediately flushed
to disk, especially for things like simply doing a write()
to
a file. The reason that plain write()
s to a file are (still) given
ZIL transactions is that you may call fsync()
on the file later.
If you don't call fsync()
and the regular ZFS transaction group
commits with your write()
s, those ZIL transactions will be quietly
cleaned out of the in-memory ZIL log (along with all of the other now
unneeded ZIL transactions).
(All of this assumes that your dataset doesn't have 'sync=disabled
'
set, which turns off the in-memory ZIL as one of its effects.)
When you perform an action such as fsync()
or sync()
that
requests that in-memory ZFS state be made durable on disk, ZFS
gathers up some or all of those in-memory ZIL transactions and
writes them to disk in one go, as a sequence of log (write) blocks
('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL
transaction records. This is called a ZIL commit. Depending on
various factors, the
flushed out data you write()
may or may not be included in the
log (write) blocks committed to the (dataset's) ZIL. Sometimes your
file data will be written directly into its future permanent location
in the pool's free space (which is safe)
and the ZIL commit will have only a pointer to this location (its
DVA).
(For a discussion of this, see the comments about the WR_*
constants in zil.h. Also, while in memory, ZFS transactions
are classified as either 'synchronous' or 'asynchronous'.
Sync transactions are always part of a ZIL commit, but async
transactions are only included as necessary. See zil_impl.h
and also my entry discussing this.)
It's possible for several processes (or threads) to all call sync()
or fsync()
at once (well, before the first one finishes committing
the ZIL). In this case, their requests can all be merged together
into one ZIL commit that covers all of them. This means that fsync()
and sync()
calls don't necessarily match up one to one with ZIL
commits. I believe it's also possible for a fsync()
or sync()
to not result in a ZIL commit if all of the relevant data has already
been written out as part of a regular ZFS transaction group (or a
previous request).
Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).
2023-10-29
One reason that ZFS can't turn a directory into a filesystem
One of the wishes that I and other people frequently have for ZFS is the ability to take an existing directory (and everything underneath it) in a ZFS filesystem and turn it into a sub-filesystem of its own. One reason for wanting this is that a number of things are set and controlled on a per-filesystem basis in ZFS, instead of on a per-directory basis; if you have a (sub)directory where you want any special value for those, you need to make it a filesystem of its own. Often you may not immediately realize this before the directory exists and has been populated, and you discover the need for the special setting values. Today I realized that one reason ZFS doesn't have this feature is because of how ZFS filesystems are put together.
ZFS is often described as tree structured, and this is generally true; a lot of things in a ZFS pool are organized into a tree of objects. However, while filesystems are a tree at the logical level of directories and subdirectories, they aren't a tree as represented on disk. Directories in ZFS filesystems don't directly point to the disk addresses of their contents; instead, ZFS filesystems have a flat, global table of object numbers (effectively inode numbers) and all directory entries refer to things by object number. Since ZFS is a copy on write filesystem, this level of indirection is quite important in reducing how much has to be updated when a file or a directory is changed.
If ZFS filesystems used tree structured references at the level of directory entries (and we ignored hardlinks), it would make conceptual sense that you could take a directory object, pull it into a new filesystem, and patch its reference in its parent directory. All of the object references in the tree under the directory would stay the same; they would just be in a new container, the new filesystem. Filesystems would essentially be cut points in the overall object tree.
However, you can't make this model work when filesystems have a single global space of object numbers that are used in directory entries. A new filesystem has its own new table of object numbers, and you would have to move all of the objects referred to by the directory hierarchy into this new table, which means you'd have to walk the directory tree to find them all and then possibly update all of the directories if you changed their object numbers as part of putting them in a new object (number) table. This isn't the sort of work that you should be asking a filesystem to do in the kernel; it's much more suited for a user level tool.
Now that I've thought of this, it's even more understandable how ZFS doesn't have this feature, however convenient for me it would be, and how it never will.
(Hardlinks by themselves probably cause enough heartburn to sink a feature to turn a directory into a filesystem, although I can see ways to deal with them if you try hard enough.)