Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem
The ZFS pools on our fileservers all have overall pool quotas, ultimately because of how we sell storage to people, and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, this has led to fileserver lockups. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota.
The symptoms are less severe, in that the fileserver in question
only get fairly unresponsive to NFS (especially to the machine that
the writes were coming from) instead of locking up. This was somewhat
variable and may have primarily affected the particular filesystem
or perhaps the particular pool it's in, instead of all of the
filesystems and pools on the fileserver; I didn't attempt to gather
this data during the recent incident where I re-observed this, but
certainly some machines could still do things like issue
against the fileserver.
(This was of course our biggest fileserver.)
During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but that was fixed several years ago; still, this could be related.
Our DTrace stuff did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this.
How relevant this is to current OmniOS CE and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem.
(We will probably not attempt to investigate this at all on our current fileservers, since our next general will not run any version of Illumos.)
Some things on Illumos NFS export permissions
Perhaps at one point I fully understood Solaris and thus Illumos NFS export permissions (but I suspect not). If so, that understanding fell out of my mind at some point over the years since then, and just now I had the interesting experience of discovering that our NFS export permissions have sort of been working by accident.
I'll start with a ZFS
sharenfs setting and then break it down.
The two most ornate ones we have look like this:
The AAA and BBB netgroups don't overlap with
nfs_oldmail are both subsets of
The first slightly tricky bit is
root=. As the manual page explains in the NOTES section,
root= does is change the interpretation of UID 0 for
clients that are already allowed to read or write to the NFS share.
Per the manual page, 'the access the host gets is the same as when
root= option is absent' (and this may include no access).
As a corollary, '
root=NG,ro=NG' is basically the same as
ro=NG,anon=0'. Since our
root= netgroups are a subset of our
general allowed-access netgroups, we're okay here.
(This part I sort of knew already, or at least I assumed it without having hunted it down specifically in the manual page. See eg this entry.)
The next tricky bit is the interaction of
just now I would have told you that
rw= took priority over
if you had a host that was included in both (via different netgroups),
but it turns out that whichever one is first takes priority.
We were getting rw-over-ro effects because we always listed
first, but I don't think we necessarily understood that when we
wrote the second
sharenfs setting. The manual page is explicit
ro=options are specified in the same
sec=clause, and a client is in both lists, the order of the two options determines the access the client gets.
(Note that the behavior is different if you use general
See the manpage.)
We would have noticed if we flipped this around for the one filesystem
rw= groups, since the machine that was
supposed to be able to write to the filesystem would have failed
(and the failure would have stalled our mail system). But it's still
sort of a narrow escape.
What this shows me vividly, once again, is the appeal of casual superstition. I really thought I understood how Illumos NFS exports worked (and I only checked the manpage to see if it described things explicitly, and that because I was writing an entry for here). Instead I had drifted into a convenient assumption of how things were.
Sidebar: Our general narrow miss on this
We have a bunch of local programs for managing our fileservers. One of the things these programs do is manipulate NFS exports options, so that we can have a configuration file that sets general share options and then allows us to specify that specific filesystems extend them, with convenient syntax, eg:
# global share options shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root # our SAN filesystems: fs3-corestaff-01 /h/281 rw+=AAA
This means that /h/281 should be exported read-write to the AAA netgroup as well as the usual main netgroup for our own machines.
The actual code is written in Python and turns all of the NFS exports
options into Python dictionary keys and values. Python dictionaries
are unordered, so under normal circumstances reassembling the exports
options would have put them into some random order, so anything
ro= could have wound up in the wrong order.
However, conveniently I decided to put the NFS export options into
a canonical order when I converted them back to string form, and
sec=sys before both). There's
no sign in my code comments that I knew this was important; it seems
to have just been what I thought of as the correct canonical ordering.
Possibly I was blindly copying and preserving earlier work where
we always had
Understanding ZFS System Attributes
Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.
Like most filesystems, ZFS puts these attributes in dnodes using
some extra space (in what is called the dnode 'bonus buffer').
However, the ZFS trick is that whatever system attributes a dnode
has are simply packed into that space without being organized into
formal structures with a fixed order of attributes. Code that uses
system attributes retrieves them from dnodes indirectly by asking
for, say, the
ZPL_PARENT of a dnode; it never cares exactly how
they're packed into a given dnode. However, obviously something
One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.
(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)
I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.
As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.
By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.
How ZFS makes things like '
zfs diff' report filenames efficiently
As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.
In theory, turning an inode or dnode number into the path to a file
is an expensive operation; you basically have to search the entire
filesystem until you find it. In practice, if you've ever run '
diff', you've likely noticed that it runs pretty fast. Nor is
this the only place that ZFS quickly turns dnode numbers into full
paths, as it comes up in '
zpool status' reports about permanent
errors. At one level,
zfs diff and
zpool status do this so rapidly because they ask the ZFS code in
the kernel to do it for them. At another level, the question is how
the kernel's ZFS code can be so fast.
The interesting and surprising answer is that ZFS cheats, in a way
that makes things very fast when it works and almost always works
in normal filesystems and with normal usage patterns. The cheat is
that ZFS dnodes record their parent's object number. Here, let's
show this in
# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File)
b file has a
parent field that points to
directory it's in, and the
a directory has a
parent field that
cks/tmp, and so on. When the kernel wants to get the
name for a given object number, it can just fetch the object, look
parent, and start going back up the filesystem.
If you're familiar with the twists and turns of Unix filesystems,
you're now wondering how ZFS deals with hardlinks, which can cause
a file to be in several directories at once and so have several
parents (and then it can be removed from some of the directories).
The answer is that ZFS doesn't; a dnode only ever tracks a single
parent, and ZFS accepts that this parent information can be
inaccurate. I'll quote the comment in
When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.
Before I get into the details, I want to say that I appreciate the
brute force elegance of this cheat. The practical reality is that
most Unix files today don't have extra hardlinks, and when they do
most hardlinks are done in ways that won't break ZFS's
stuff. The result is that ZFS has picked an efficient implementation
that works almost all of the time; in my opinion, the great benefit
we get from having it around are more than worth the infrequent
cases where it fails or malfunctions. Both
zfs diff and having
filenames show up in
zpool status permanent error reports are
very useful (and there may be other cases where this gets used).
The current details are that any time you hardlink a file to somewhere
or rename it, ZFS updates the file's
parent to point to the new
directory. Often this will wind up with a correct
after all of the dust settles; for example, a common pattern is to
write a file to an initial location, hardlink it to its final
destination, and then remove the initial location version. In this
parent will be correct and you'll get the right name.
The time when you get an incorrect
parent is this sequence:
; mkdir a b; touch a/demo ; ln a/demo b/ ; rm b/demo
a/demo is the remaining path, but
demo's dnode will claim
that its parent is
b. I believe that
zfs diff will even report
this as the path, because the kernel doesn't do the extra work
to scan the
b directory to verify that
demo is present in it.
(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)
What ZFS block pointers are and what's in them
I've mentioned ZFS block pointers in the past; for example, when I wrote about some details of ZFS DVAs, I said that DVAs are embedded in block pointers. But I've never really looked carefully at what is in block pointers and what that means and implies for ZFS.
The very simple way to describe a ZFS block pointer is that it's what ZFS uses in places where other filesystems would simply put a block number. Just like block numbers but unlike things like ZFS dnodes, a block pointer isn't a separate on-disk entity; instead it's an on disk data format and an in memory structure that shows up in other things. To quote from the (draft and old) ZFS on-disk specification (PDF):
A block pointer (blkptr_t) is a 128 byte ZFS structure used to physically locate, verify, and describe blocks of data on disk.
Block pointers are embedded in any ZFS on disk structure that points directly to other disk blocks, both for data and metadata. For instance, the dnode for a file contains block pointers that refer to either its data blocks (if it's small enough) or indirect blocks, as I saw in this entry. However, as I discovered when I paid attention, most things in ZFS only point to dnodes indirectly, by giving their object number (either in a ZFS filesystem or in pool-wide metadata).
So what's in a block pointer itself? You can find the technical details for modern ZFS in spa.h, so I'm going to give a sort of summary. A regular block pointer contains:
- various metadata and flags about what the block pointer is for and
what parts of it mean, including what type of object it points to.
- Up to three DVAs that say where to actually
find the data on disk. There can be more than one DVA because you
may have set the
copiesproperty to 2 or 3, or this may be metadata (which normally has two copies and may have more for sufficiently important metadata).
- The logical size (size before compression) and 'physical' size (the
nominal size after compression) of the disk block. The physical
size can do odd things and is not
necessarily the asize (allocated size) for the DVA(s).
- The txgs that the block was born in, both logically
and physically (the physical txg is apparently for dva). The
physical txg was added with ZFS deduplication but apparently also
shows up in vdev removal.
- The checksum of the data the block pointer describes. This checksum implicitly covers the entire logical size of the data, and as a result you must read all of the data in order to verify it. This can be an issue on raidz vdevs or if the block had to use gang blocks.
Just like basically everything else in ZFS, block pointers don't have an explicit checksum of their contents. Instead they're implicitly covered by the checksum of whatever they're embedded in; the block pointers in a dnode are covered by the overall checksum of the dnode, for example. Block pointers must include a checksum for the data they point to because such data is 'out of line' for the containing object.
(The block pointers in a dnode don't necessarily point straight to data. If there's more than a bit of data in whatever the dnode covers, the dnode's block pointers will instead point to some level of indirect block, which itself has some number of block pointers.)
There is a special type of block pointer called an embedded block pointer. Embedded block pointers directly contain up to 112 bytes of data; apart from the data, they contain only the metadata fields and a logical birth txg. As with conventional block pointers, this data is implicitly covered by the checksum of the containing object.
Since block pointers directly contain the address of things on disk (in the form of DVAs), they have to change any time that address changes, which means any time ZFS does its copy on write thing. This forces a change in whatever contains the block pointer, which in turn ripples up to another block pointer (whatever points to said containing thing), and so on until we eventually reach the Meta Object Set and the uberblock. How this works is a bit complicated, but ZFS is designed to generally make this a relatively shallow change with not many levels of things involved (as I discovered recently).
As far as I understand things, the logical birth txg of a block pointer is the transaction group in which the block pointer was allocated. Because of ZFS's copy on write principle, this means that nothing underneath the block pointer has been updated or changed since that txg; if something changed, it would have been written to a new place on disk, which would have forced a change in at least one DVA and thus a ripple of updates that would update the logical birth txg.
However, this doesn't quite mean what I used to think it meant because of ZFS's level of indirection. If you change a file by writing data to it, you will change some of the file's block pointers, updating their logical birth txg, and you will change the file's dnode. However, you won't change any block pointers and thus any logical birth txgs for the filesystem directory the file is in (or anything else up the directory tree), because the directory refers to the file through its object number, not by directly pointing to its dnode. You can still use logical birth txgs to efficiently find changes from one txg to another, but you won't necessarily get a filesystem level view of these changes; instead, as far as I can see, you will basically get a view of what object(s) in a filesystem changed (effectively, what inode numbers changed).
(ZFS has an interesting hack to make things like '
zfs diff' work
far more efficiently than you would expect in light of this, but
that's going to take yet another entry to cover.)
A broad overview of how ZFS is structured on disk
When I wrote yesterday's entry, it became clear that I didn't understand as much about how ZFS is structured on disk (and that this matters, since I thought that ZFS copy on write updates updated a lot more than they do). So today I want to write down my new broad understanding of how this works.
(All of this can be dug out of the old, draft ZFS on-disk format specification, but that spec is written in a very detailed way and things aren't always immediately clear from it.)
Almost everything in ZFS is in DMU object. All objects are defined by a dnode, and object dnodes are almost always grouped together in an object set. Object sets are themselves DMU objects; they store dnodes as basically a giant array in a 'file', which uses data blocks and indirect blocks and so on, just like anything else. Within a single object set, dnodes have an object number, which is the index of their position in the object set's array of dnodes.
(Because an object number is just the index of the object's dnode
in its object set's array of dnodes, object numbers are basically
always going to be duplicated between object sets (and they're
always relative to an object set). For instance, pretty much every
object set is going to have an object number ten, although not all
object sets may have enough objects that they have an object number
One corollary of this is that if you ask
zdb to tell you about
a given object number, you have to tell
zdb what object set you're
talking about. Usually you do this by telling
zdb which ZFS
filesystem or dataset you mean.)
Each ZFS filesystem has its own object set for objects (and thus dnodes) used in the filesystem. As I discovered yesterday, every ZFS filesystem has a directory hierarchy and it may go many levels deep, but all of this directory hierarchy refers to directories and files using their object number.
ZFS organizes and keeps track of filesystems, clones, and snapshots through the DSL (Dataset and Snapshot Layer). The DSL has all sorts of things; DSL directories, DSL datasets, and so on, all of which are objects and many of which refer to object sets (for example, every ZFS filesystem must refer to its current object set somehow). All of these DSL objects are themselves stored as dnodes in another object set, the Meta Object Set, which the uberblock points to. To my surprise, object sets are not stored in the MOS (and as a result do not have 'object numbers'). Object sets are always referred to directly, without indirection, using a block pointer to the object set's dnode.
(I think object sets are referred to directly so that snapshots can freeze their object set very simply.)
The DSL directories and datasets for your pool's set of filesystems form a tree themselves (each filesystem has a DSL directory and at least one DSL dataset). However, just like in ZFS filesystems, all of the objects in this second tree refer to each other indirectly, by their MOS object number. Just as with files in ZFS filesystems, this level of indirection limits the amount of copy on write updates that ZFS had to do when something changes.
PS: If you want to examine MOS objects with
zdb, I think you do
it with something like '
zdb -vvv -d ssddata 1', which will get
you object number 1 of the MOS, which is the MOS object directory.
If you want to ask
zdb about an object in the pool's root filesystem,
zdb -vvv -d ssddata/ 1'. You can tell which one you're
getting depending on what
zdb prints out. If it says 'Dataset
mos [META]' you're looking at objects from the MOS; if it says
'Dataset ssddata [ZPL]', you're looking at the pool's root filesystem
(where object number 1 is the ZFS master node).
PPS: I was going to write up what changed on a filesystem write, but then I realized that I didn't know how blocks being allocated and freed are reflected in pool structures. So I'll just say that I think that ignoring free space management, only four DMU objects get updated; the file itself, the filesystem's object set, the filesystem's DSL dataset object, and the MOS.
(As usual, doing the research to write this up taught me things that I didn't know about ZFS.)
When you make changes, ZFS updates much less stuff than I thought
In the past, for example in my entry on how ZFS bookmarks can work with reasonable efficiency, I have given what I think of as the standard explanation of how ZFS's copy on write nature forces changes to things like the data in a file to ripple up all the way to the top of the ZFS hierarchy. To quote myself:
If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, [...]
This is wrong. ZFS is structured so that it doesn't have to ripple changes all the way up through the filesystem just because you changed a piece of it down in the depths of a directory hierarchy.
How this works is through the usual CS trick of a level of indirection. All objects in a ZFS filesystem have an object number, which we've seen come up before, for example in ZFS delete queues. Once it's created, the object number of something never changes. Almost everything in a ZFS filesystem refers to other objects in the filesystem by their object number, not by their (current) disk location. For example, directories in your filesystem refer to things by their object numbers:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testdir Object lvl iblk dblk dsize dnsize lsize %full type 1003162 1 128K 512 0 512 512 100.00 ZFS directory [...] microzap: 512 bytes, 1 entries ATESTFILE = 1003019 (type: Regular File) [...]
The directory doesn't tell us where
ATESTFILE is on the disk, it
just tells us that it's object 1003019.
In order to find where objects are, ZFS stores a per filesystem mapping from object number to actual disk locations that we can sort of think of as a big file; these are called object sets. More exactly, each object number maps to a ZFS dnode, and the ZFS dnodes are stored in what is conceptually an on-disk array ('indexed' by the object number). As far as I can tell, an object's dnode is the only thing that knows where its data is located on disk.
So, suppose that we overwrite data in
ATESTFILE. ZFS's copy on
write property means that we have to write a new version of the
data block, possibly a new version of some number of indirect blocks
(if the file is big enough), and then a new version of the dnode
so that it points to the new data block or indirect block. Because
the dnode itself is part of a block of dnodes in the object set,
we must write a new copy of that block of dnodes and then ripple
the changes up the indirect blocks and so on (eventually reaching
the uberblock as part of a transaction group commit). However, we
don't have to change any directories in the ZFS filesystem, no
matter how deep the file is in them; while we changed the file's
dnode (or if you prefer, the data in the dnode), we didn't change
its object number, and the directories only refer to it by object
number. It was object number 1003019 before we wrote data to it and
it's object number 1003019 after we did, so our
directory is untouched.
Once I thought about it, this isn't particularly different from how conventional Unix filesystems work (what ZFS calls an object number is what we conventionally call an inode number). It's especially forced by the nature of a copy on write Unix filesystem, given that due to hardlinks a file may be referred to from multiple directories. If we had to update every directory a file was linked from whenever the file changed, we'd need some way to keep track of them all, and that would cause all sorts of implementation issues.
(Now that I've realized this it all feels obvious and necessary. Yet at the same time I've been casually explaining ZFS copy on write updates wrong for, well, years. And yes, when I wrote "directory metadata" in my earlier entry, I meant the filesystem directory, not the object set's 'directory' of dnodes.)
Sidebar: The other reason to use inode numbers or object numbers
Although modern filesystems may have 512 byte inodes or dnodes, Unix has traditionally used ones that were smaller than a disk block and thus that were packed several to a (512 byte) disk block. If you need to address something smaller than a disk block, you can't just use the disk block number where the thing is; you need either the disk block number plus an index into it, or you can make things more compact by just having a global index number, ie the inode number.
The original Unix filesystems made life even simpler by storing all inodes in one contiguous chunk of disk space toward the start of the filesystem. This made calculating the disk block that held a given inode a pretty simple process. (For the sake of your peace of mind, you probably don't want to know just how simple it was in V7.)
What ZFS messages about 'permanent errors in <0x95>:<0x0>' mean
If you use ZFS long enough (or are unlucky enough), one of the things
you may run into are reports in
zpool status -v of permanent errors
in something (we've had that happen to us despite redundancy). If you're reasonably lucky, the error message
will have a path in it. If you're unlucky, the error message will say
errors: Permanent errors have been detected in the following files:<0x95>:<0x0>
The short answer of what they mean is, to quote directly:
The first number is the dataset id (index) and the second is the object id. For filesystems, the object id can be the same as the file's "inode" as shown by "ls -i" But a few obect ids exist for all datasets. Object id 0 is the DMU dnode.
The dataset here may be a ZFS filesystem, a snapshot, or I believe a few other things. I believe that if it's still in existence, you'll normally get at least its name and perhaps the full path to the object. When it's not in existence any more (perhaps you deleted the snapshot or the whole filesystem in question since the scrub detected it), you get this hex ID and there's also no information about the path.
The reason the information is presented this way is that what the
ZFS code in the kernel saves and returns to the
zpool command is
actually just the dataset and object ID. It's up to
zpool to turn
both of these into names, which it actually does by calling back
into the kernel to find out what they're currently called, if the
kernel knows. Inspecting the relevant ZFS code
says that there are five cases:
<metadata>:<0x...>means corruption in some object in the pool's overall metadata object set.
<0x...>:<0x...>means that the dataset involved can't be identified (and thus ZFS has no hope of identifying the thing inside the dataset).
/some/path/namemeans you have a corrupted filesystem object (a file, a directory, etc) in a currently mounted dataset and this is its full current path.
(I think that ZFS's determination of the path name for a given ZFS object is pretty reliable; if I'm reading the code right, it appears to be able to scan upward in the filesystem hierarchy starting with the object itself.)
dsname:/some/pathmeans that the dataset is called
dsnamebut it's not currently mounted, and
/some/pathis the path within it. I think this happens for snapshots.
dsname:<0x...>means that it's in the given dataset
dsname(which may or may not be mounted), but the ZFS object in question can't have its path identified for various reasons (including that it's already been deleted).
Only things in ZFS filesystems (and snapshots and so on) have path names, so an error in a ZVOL will always be reported without the path. I'm not sure what the reported dataset names are for ZVOLs, since I don't use ZVOLs.
The final detail is that you may see this error status in '
-v' even after you've cleaned it up. To quote Richard Elling again:
Finally, the error buffer for "zpool status" contains information for two scan passes: the current and previous scans. So it is possible to delete an object (eg file) and still see it listed in the error buffer. It takes two scans to completely update the error buffer. This is important if you go looking for a dataset+object tuple with zdb and don't find anything...
PS: There are some cases where
<xattrdir> will appear in the file
path. If I'm reading the code correctly, this happens when the
problem is in an extended attribute instead of the filesystem object
PPS: Richard Elling's message was on the ZFS on Linux mailing list and about an issue someone was having with a ZoL system, but as far as I can see the core code is basically the same in Illumos and I would expect in FreeBSD as well, so this bit of ZFS wisdom should be cross-platform.
ZFS pushes file renamings and other metadata changes to disk quite promptly
One of the general open questions on Unix is when changes like
renaming or creating files are actually durably on disk. Famously,
some filesystems on some Unixes have been willing to delay this for
an unpredictable amount of time unless you did things like
the containing directory of your renamed file, not just
the file itself. As it happens, ZFS's design means that it offers
some surprisingly strong guarantees about this; specifically, ZFS
persists all metadata changes to disk no later than the next
transaction group commit. In ZFS today, a transaction group commit
generally happens every five seconds, so if you do something like
rename a file, your rename will be fully durable quite soon even if
you do nothing special.
However, this doesn't mean that if you create a file, write data
to the file, and then rename it (with no other special operations)
that in five or ten seconds your new file is guaranteed to be present
under its new name with all the data you wrote. Although metadata
operations like creating and renaming files go to ZFS right away
and then become part of the next txg commit, the kernel generally
holds on to written file data for a while before pushing it out.
You need some sort of
fsync() in there to force the kernel to
commit your data, not just your file creation and renaming. Because
of how the ZFS intent log works, you don't need
to do anything more than
fsync() your file here; when you
a file, all pending metadata changes are flushed out to disk along
with the file data.
(In a 'create new version, write, rename to overwrite current
version' setup, I think you want to
fsync() the file twice, once
after the write and then once after the rename. Otherwise you haven't
necessarily forced the rename itself to be written out. You don't
want to do the rename before a
fsync(), because then I think that
a crash at just the wrong time could give you an empty new file.
But the ice is thin here in portable code, including code that wants
to be portable to different filesystem types.)
My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.
spare-N spare vdevs in your pool are mirror vdevs
Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:
[...] wwn-0x5000cca251b79b98 ONLINE 0 0 0 spare-8 ONLINE 0 0 0 wwn-0x5000cca251c7b9d8 ONLINE 0 0 0 wwn-0x5000cca2568314fc ONLINE 0 0 0 wwn-0x5000cca251ca10b0 ONLINE 0 0 0 [...]
What is this
spare-8 thing, beyond 'a sign that a spare activated
here'? This is sometimes called a 'spare vdev', and the answer is
that spare vdevs are mirror vdevs.
Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.
(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)
What this means is that these
spare-N vdevs behave like mirror
vdevs. Assuming that both sides are healthy, reads can be satisfied
from either side (and will be balanced back and forth as they are
for mirror vdevs), writes will go to both sides, and a scrub will
check both sides. As a result, if you scrub a pool with a
vdev and there are no problems reported for either component device,
then both old and new device are fine and contain a full and intact
copy of the data. You can keep either (or both).
As a side note, it's possible to manually create your own
vdevs even without a fault, because spares activation is actually
a user-level thing in ZFS. Although I haven't
tested this recently, you generally get a
spare-N vdev if you do
zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK>
is configured as a spare in the pool. Abusing this to create long
term mirrors inside raidZ vdevs is left as an exercise to the reader.
(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)
PS: As you might expect, the
replacing-N vdev that you get when
you replace a disk is also a mirror vdev, with the special behavior
than when the resilver finishes, the original device is normally