Wandering Thoughts


Our ZFS fileservers aren't happy when you do NFS writes to a full filesystem

The ZFS pools on our fileservers all have overall pool quotas, ultimately because of how we sell storage to people, and we've historically had problems when a pool fills completely up to its quota limit and people keep writing to it. In the past, this has led to fileserver lockups. Today I got a reminder of something I think we've seen before, which is that we can also get problems when just a filesystem fills up to its individual quota limit even if the pool is still under its overall quota.

The symptoms are less severe, in that the fileserver in question only get fairly unresponsive to NFS (especially to the machine that the writes were coming from) instead of locking up. This was somewhat variable and may have primarily affected the particular filesystem or perhaps the particular pool it's in, instead of all of the filesystems and pools on the fileserver; I didn't attempt to gather this data during the recent incident where I re-observed this, but certainly some machines could still do things like issue dfs against the fileserver.

(This was of course our biggest fileserver.)

During the incident, the fileserver was generally receiving from the network at full line bandwidth; although I don't know for sure, I'm guessing that these were NFS writes. DTrace monitoring showed that it generally had several hundred outstanding NFS requests but wasn't actually doing much successful NFS IO (not surprising, if all of this traffic was writes that were getting rejected because the filesystem had hit its quota limits). Our fileservers used to get badly overloaded from too-fast NFS write IO in general, but that was fixed several years ago; still, this could be related.

Our DTrace stuff did report (very) long NFS operations and that report eventually led me to the source and let me turn it off. When the writes stopped, the fileserver recovered almost immediately and became fully responsive, including to the NFS client machine that was most affected by this.

How relevant this is to current OmniOS CE and Illumos is an open question; we're still running the heavily unsupported OmniOS r151014, and not a completely up to date version of it. Never the less, I feel like writing it down. Perhaps now I'll remember to check for full filesystems the next time we have a mysterious fileserver problem.

(We will probably not attempt to investigate this at all on our current fileservers, since our next general will not run any version of Illumos.)

ZFSNFSFilesystemQuotaProblem written at 00:41:14; Add Comment


Some things on Illumos NFS export permissions

Perhaps at one point I fully understood Solaris and thus Illumos NFS export permissions (but I suspect not). If so, that understanding fell out of my mind at some point over the years since then, and just now I had the interesting experience of discovering that our NFS export permissions have sort of been working by accident.

I'll start with a ZFS sharenfs setting and then break it down. The two most ornate ones we have look like this:


The AAA and BBB netgroups don't overlap with nfs_ssh, but nfs_root and nfs_oldmail are both subsets of nfs_ssh.

The first slightly tricky bit is root=. As the manual page explains in the NOTES section, all that root= does is change the interpretation of UID 0 for clients that are already allowed to read or write to the NFS share. Per the manual page, 'the access the host gets is the same as when the root= option is absent' (and this may include no access). As a corollary, 'root=NG,ro=NG' is basically the same as 'ro=NG,anon=0'. Since our root= netgroups are a subset of our general allowed-access netgroups, we're okay here.

(This part I sort of knew already, or at least I assumed it without having hunted it down specifically in the manual page. See eg this entry.)

The next tricky bit is the interaction of rw= and ro=. Before just now I would have told you that rw= took priority over ro= if you had a host that was included in both (via different netgroups), but it turns out that whichever one is first takes priority. We were getting rw-over-ro effects because we always listed rw= first, but I don't think we necessarily understood that when we wrote the second sharenfs setting. The manual page is explicit about this:

If rw= and ro= options are specified in the same sec= clause, and a client is in both lists, the order of the two options determines the access the client gets.

(Note that the behavior is different if you use general ro or rw. See the manpage.)

We would have noticed if we flipped this around for the one filesystem with overlapping ro= and rw= groups, since the machine that was supposed to be able to write to the filesystem would have failed (and the failure would have stalled our mail system). But it's still sort of a narrow escape.

What this shows me vividly, once again, is the appeal of casual superstition. I really thought I understood how Illumos NFS exports worked (and I only checked the manpage to see if it described things explicitly, and that because I was writing an entry for here). Instead I had drifted into a convenient assumption of how things were.

Sidebar: Our general narrow miss on this

We have a bunch of local programs for managing our fileservers. One of the things these programs do is manipulate NFS exports options, so that we can have a configuration file that sets general share options and then allows us to specify that specific filesystems extend them, with convenient syntax, eg:

# global share options 
shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root

# our SAN filesystems:
fs3-corestaff-01   /h/281   rw+=AAA

This means that /h/281 should be exported read-write to the AAA netgroup as well as the usual main netgroup for our own machines.

The actual code is written in Python and turns all of the NFS exports options into Python dictionary keys and values. Python dictionaries are unordered, so under normal circumstances reassembling the exports options would have put them into some random order, so anything with both rw= and ro= could have wound up in the wrong order. However, conveniently I decided to put the NFS export options into a canonical order when I converted them back to string form, and this put rw= before ro= (and sec=sys before both). There's no sign in my code comments that I knew this was important; it seems to have just been what I thought of as the correct canonical ordering. Possibly I was blindly copying and preserving earlier work where we always had rw= first.

IllumosNFSExportsPerms written at 23:43:51; Add Comment


Understanding ZFS System Attributes

Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.

Like most filesystems, ZFS puts these attributes in dnodes using some extra space (in what is called the dnode 'bonus buffer'). However, the ZFS trick is that whatever system attributes a dnode has are simply packed into that space without being organized into formal structures with a fixed order of attributes. Code that uses system attributes retrieves them from dnodes indirectly by asking for, say, the ZPL_PARENT of a dnode; it never cares exactly how they're packed into a given dnode. However, obviously something does.

One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.

(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)

I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.

As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.

By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.

ZFSSystemAttributes written at 01:11:37; Add Comment


How ZFS makes things like 'zfs diff' report filenames efficiently

As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.

In theory, turning an inode or dnode number into the path to a file is an expensive operation; you basically have to search the entire filesystem until you find it. In practice, if you've ever run 'zfs diff', you've likely noticed that it runs pretty fast. Nor is this the only place that ZFS quickly turns dnode numbers into full paths, as it comes up in 'zpool status' reports about permanent errors. At one level, zfs diff and zpool status do this so rapidly because they ask the ZFS code in the kernel to do it for them. At another level, the question is how the kernel's ZFS code can be so fast.

The interesting and surprising answer is that ZFS cheats, in a way that makes things very fast when it works and almost always works in normal filesystems and with normal usage patterns. The cheat is that ZFS dnodes record their parent's object number. Here, let's show this in zdb:

# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1285414    1   128K    512      0     512    512    0.00  ZFS plain file
       parent  1284472
# zdb -vvv -bbbb -O ssddata/homes cks/tmp/a
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1284472    1   128K    512      0     512    512  100.00  ZFS directory
       parent  52906
       microzap: 512 bytes, 1 entries
          b = 1285414 (type: Regular File)

The b file has a parent field that points to cks/tmp/a, the directory it's in, and the a directory has a parent field that points to cks/tmp, and so on. When the kernel wants to get the name for a given object number, it can just fetch the object, look at parent, and start going back up the filesystem.

(If you want to see this sausage being made, look at zfs_obj_to_path and zfs_obj_to_pobj in zfs_znode.c. The parent field is a ZFS dnode system attribute, specifically ZPL_PARENT.)

If you're familiar with the twists and turns of Unix filesystems, you're now wondering how ZFS deals with hardlinks, which can cause a file to be in several directories at once and so have several parents (and then it can be removed from some of the directories). The answer is that ZFS doesn't; a dnode only ever tracks a single parent, and ZFS accepts that this parent information can be inaccurate. I'll quote the comment in zfs_obj_to_pobj:

When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.

Before I get into the details, I want to say that I appreciate the brute force elegance of this cheat. The practical reality is that most Unix files today don't have extra hardlinks, and when they do most hardlinks are done in ways that won't break ZFS's parent stuff. The result is that ZFS has picked an efficient implementation that works almost all of the time; in my opinion, the great benefit we get from having it around are more than worth the infrequent cases where it fails or malfunctions. Both zfs diff and having filenames show up in zpool status permanent error reports are very useful (and there may be other cases where this gets used).

The current details are that any time you hardlink a file to somewhere or rename it, ZFS updates the file's parent to point to the new directory. Often this will wind up with a correct parent even after all of the dust settles; for example, a common pattern is to write a file to an initial location, hardlink it to its final destination, and then remove the initial location version. In this case, the parent will be correct and you'll get the right name. The time when you get an incorrect parent is this sequence:

; mkdir a b; touch a/demo
; ln a/demo b/
; rm b/demo

Here a/demo is the remaining path, but demo's dnode will claim that its parent is b. I believe that zfs diff will even report this as the path, because the kernel doesn't do the extra work to scan the b directory to verify that demo is present in it.

(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)

ZFSPathLookupTrick written at 00:51:38; Add Comment


What ZFS block pointers are and what's in them

I've mentioned ZFS block pointers in the past; for example, when I wrote about some details of ZFS DVAs, I said that DVAs are embedded in block pointers. But I've never really looked carefully at what is in block pointers and what that means and implies for ZFS.

The very simple way to describe a ZFS block pointer is that it's what ZFS uses in places where other filesystems would simply put a block number. Just like block numbers but unlike things like ZFS dnodes, a block pointer isn't a separate on-disk entity; instead it's an on disk data format and an in memory structure that shows up in other things. To quote from the (draft and old) ZFS on-disk specification (PDF):

A block pointer (blkptr_t) is a 128 byte ZFS structure used to physically locate, verify, and describe blocks of data on disk.

Block pointers are embedded in any ZFS on disk structure that points directly to other disk blocks, both for data and metadata. For instance, the dnode for a file contains block pointers that refer to either its data blocks (if it's small enough) or indirect blocks, as I saw in this entry. However, as I discovered when I paid attention, most things in ZFS only point to dnodes indirectly, by giving their object number (either in a ZFS filesystem or in pool-wide metadata).

So what's in a block pointer itself? You can find the technical details for modern ZFS in spa.h, so I'm going to give a sort of summary. A regular block pointer contains:

  • various metadata and flags about what the block pointer is for and what parts of it mean, including what type of object it points to.

  • Up to three DVAs that say where to actually find the data on disk. There can be more than one DVA because you may have set the copies property to 2 or 3, or this may be metadata (which normally has two copies and may have more for sufficiently important metadata).

  • The logical size (size before compression) and 'physical' size (the nominal size after compression) of the disk block. The physical size can do odd things and is not necessarily the asize (allocated size) for the DVA(s).

  • The txgs that the block was born in, both logically and physically (the physical txg is apparently for dva[0]). The physical txg was added with ZFS deduplication but apparently also shows up in vdev removal.

  • The checksum of the data the block pointer describes. This checksum implicitly covers the entire logical size of the data, and as a result you must read all of the data in order to verify it. This can be an issue on raidz vdevs or if the block had to use gang blocks.

Just like basically everything else in ZFS, block pointers don't have an explicit checksum of their contents. Instead they're implicitly covered by the checksum of whatever they're embedded in; the block pointers in a dnode are covered by the overall checksum of the dnode, for example. Block pointers must include a checksum for the data they point to because such data is 'out of line' for the containing object.

(The block pointers in a dnode don't necessarily point straight to data. If there's more than a bit of data in whatever the dnode covers, the dnode's block pointers will instead point to some level of indirect block, which itself has some number of block pointers.)

There is a special type of block pointer called an embedded block pointer. Embedded block pointers directly contain up to 112 bytes of data; apart from the data, they contain only the metadata fields and a logical birth txg. As with conventional block pointers, this data is implicitly covered by the checksum of the containing object.

Since block pointers directly contain the address of things on disk (in the form of DVAs), they have to change any time that address changes, which means any time ZFS does its copy on write thing. This forces a change in whatever contains the block pointer, which in turn ripples up to another block pointer (whatever points to said containing thing), and so on until we eventually reach the Meta Object Set and the uberblock. How this works is a bit complicated, but ZFS is designed to generally make this a relatively shallow change with not many levels of things involved (as I discovered recently).

As far as I understand things, the logical birth txg of a block pointer is the transaction group in which the block pointer was allocated. Because of ZFS's copy on write principle, this means that nothing underneath the block pointer has been updated or changed since that txg; if something changed, it would have been written to a new place on disk, which would have forced a change in at least one DVA and thus a ripple of updates that would update the logical birth txg.

However, this doesn't quite mean what I used to think it meant because of ZFS's level of indirection. If you change a file by writing data to it, you will change some of the file's block pointers, updating their logical birth txg, and you will change the file's dnode. However, you won't change any block pointers and thus any logical birth txgs for the filesystem directory the file is in (or anything else up the directory tree), because the directory refers to the file through its object number, not by directly pointing to its dnode. You can still use logical birth txgs to efficiently find changes from one txg to another, but you won't necessarily get a filesystem level view of these changes; instead, as far as I can see, you will basically get a view of what object(s) in a filesystem changed (effectively, what inode numbers changed).

(ZFS has an interesting hack to make things like 'zfs diff' work far more efficiently than you would expect in light of this, but that's going to take yet another entry to cover.)

ZFSBlockPointers written at 23:19:40; Add Comment

A broad overview of how ZFS is structured on disk

When I wrote yesterday's entry, it became clear that I didn't understand as much about how ZFS is structured on disk (and that this matters, since I thought that ZFS copy on write updates updated a lot more than they do). So today I want to write down my new broad understanding of how this works.

(All of this can be dug out of the old, draft ZFS on-disk format specification, but that spec is written in a very detailed way and things aren't always immediately clear from it.)

Almost everything in ZFS is in DMU object. All objects are defined by a dnode, and object dnodes are almost always grouped together in an object set. Object sets are themselves DMU objects; they store dnodes as basically a giant array in a 'file', which uses data blocks and indirect blocks and so on, just like anything else. Within a single object set, dnodes have an object number, which is the index of their position in the object set's array of dnodes.

(Because an object number is just the index of the object's dnode in its object set's array of dnodes, object numbers are basically always going to be duplicated between object sets (and they're always relative to an object set). For instance, pretty much every object set is going to have an object number ten, although not all object sets may have enough objects that they have an object number ten thousand. One corollary of this is that if you ask zdb to tell you about a given object number, you have to tell zdb what object set you're talking about. Usually you do this by telling zdb which ZFS filesystem or dataset you mean.)

Each ZFS filesystem has its own object set for objects (and thus dnodes) used in the filesystem. As I discovered yesterday, every ZFS filesystem has a directory hierarchy and it may go many levels deep, but all of this directory hierarchy refers to directories and files using their object number.

ZFS organizes and keeps track of filesystems, clones, and snapshots through the DSL (Dataset and Snapshot Layer). The DSL has all sorts of things; DSL directories, DSL datasets, and so on, all of which are objects and many of which refer to object sets (for example, every ZFS filesystem must refer to its current object set somehow). All of these DSL objects are themselves stored as dnodes in another object set, the Meta Object Set, which the uberblock points to. To my surprise, object sets are not stored in the MOS (and as a result do not have 'object numbers'). Object sets are always referred to directly, without indirection, using a block pointer to the object set's dnode.

(I think object sets are referred to directly so that snapshots can freeze their object set very simply.)

The DSL directories and datasets for your pool's set of filesystems form a tree themselves (each filesystem has a DSL directory and at least one DSL dataset). However, just like in ZFS filesystems, all of the objects in this second tree refer to each other indirectly, by their MOS object number. Just as with files in ZFS filesystems, this level of indirection limits the amount of copy on write updates that ZFS had to do when something changes.

PS: If you want to examine MOS objects with zdb, I think you do it with something like 'zdb -vvv -d ssddata 1', which will get you object number 1 of the MOS, which is the MOS object directory. If you want to ask zdb about an object in the pool's root filesystem, use 'zdb -vvv -d ssddata/ 1'. You can tell which one you're getting depending on what zdb prints out. If it says 'Dataset mos [META]' you're looking at objects from the MOS; if it says 'Dataset ssddata [ZPL]', you're looking at the pool's root filesystem (where object number 1 is the ZFS master node).

PPS: I was going to write up what changed on a filesystem write, but then I realized that I didn't know how blocks being allocated and freed are reflected in pool structures. So I'll just say that I think that ignoring free space management, only four DMU objects get updated; the file itself, the filesystem's object set, the filesystem's DSL dataset object, and the MOS.

(As usual, doing the research to write this up taught me things that I didn't know about ZFS.)

ZFSBroadDiskStructure written at 01:16:58; Add Comment


When you make changes, ZFS updates much less stuff than I thought

In the past, for example in my entry on how ZFS bookmarks can work with reasonable efficiency, I have given what I think of as the standard explanation of how ZFS's copy on write nature forces changes to things like the data in a file to ripple up all the way to the top of the ZFS hierarchy. To quote myself:

If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, [...]

This is wrong. ZFS is structured so that it doesn't have to ripple changes all the way up through the filesystem just because you changed a piece of it down in the depths of a directory hierarchy.

How this works is through the usual CS trick of a level of indirection. All objects in a ZFS filesystem have an object number, which we've seen come up before, for example in ZFS delete queues. Once it's created, the object number of something never changes. Almost everything in a ZFS filesystem refers to other objects in the filesystem by their object number, not by their (current) disk location. For example, directories in your filesystem refer to things by their object numbers:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testdir
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  1003162    1   128K    512      0     512    512  100.00  ZFS directory
    microzap: 512 bytes, 1 entries
       ATESTFILE = 1003019 (type: Regular File)

The directory doesn't tell us where ATESTFILE is on the disk, it just tells us that it's object 1003019.

In order to find where objects are, ZFS stores a per filesystem mapping from object number to actual disk locations that we can sort of think of as a big file; these are called object sets. More exactly, each object number maps to a ZFS dnode, and the ZFS dnodes are stored in what is conceptually an on-disk array ('indexed' by the object number). As far as I can tell, an object's dnode is the only thing that knows where its data is located on disk.

So, suppose that we overwrite data in ATESTFILE. ZFS's copy on write property means that we have to write a new version of the data block, possibly a new version of some number of indirect blocks (if the file is big enough), and then a new version of the dnode so that it points to the new data block or indirect block. Because the dnode itself is part of a block of dnodes in the object set, we must write a new copy of that block of dnodes and then ripple the changes up the indirect blocks and so on (eventually reaching the uberblock as part of a transaction group commit). However, we don't have to change any directories in the ZFS filesystem, no matter how deep the file is in them; while we changed the file's dnode (or if you prefer, the data in the dnode), we didn't change its object number, and the directories only refer to it by object number. It was object number 1003019 before we wrote data to it and it's object number 1003019 after we did, so our cks/tmp/testdir directory is untouched.

Once I thought about it, this isn't particularly different from how conventional Unix filesystems work (what ZFS calls an object number is what we conventionally call an inode number). It's especially forced by the nature of a copy on write Unix filesystem, given that due to hardlinks a file may be referred to from multiple directories. If we had to update every directory a file was linked from whenever the file changed, we'd need some way to keep track of them all, and that would cause all sorts of implementation issues.

(Now that I've realized this it all feels obvious and necessary. Yet at the same time I've been casually explaining ZFS copy on write updates wrong for, well, years. And yes, when I wrote "directory metadata" in my earlier entry, I meant the filesystem directory, not the object set's 'directory' of dnodes.)

Sidebar: The other reason to use inode numbers or object numbers

Although modern filesystems may have 512 byte inodes or dnodes, Unix has traditionally used ones that were smaller than a disk block and thus that were packed several to a (512 byte) disk block. If you need to address something smaller than a disk block, you can't just use the disk block number where the thing is; you need either the disk block number plus an index into it, or you can make things more compact by just having a global index number, ie the inode number.

The original Unix filesystems made life even simpler by storing all inodes in one contiguous chunk of disk space toward the start of the filesystem. This made calculating the disk block that held a given inode a pretty simple process. (For the sake of your peace of mind, you probably don't want to know just how simple it was in V7.)

ZFSDirectoriesAndChanges written at 22:48:33; Add Comment


What ZFS messages about 'permanent errors in <0x95>:<0x0>' mean

If you use ZFS long enough (or are unlucky enough), one of the things you may run into are reports in zpool status -v of permanent errors in something (we've had that happen to us despite redundancy). If you're reasonably lucky, the error message will have a path in it. If you're unlucky, the error message will say something like:

errors: Permanent errors have been detected in the following files:

This is a mysterious and frustrating message. On the ZFS on Linux mailing list, Richard Elling recently shared some extremely useful information about what they mean in this message.

The short answer of what they mean is, to quote directly:

The first number is the dataset id (index) and the second is the object id. For filesystems, the object id can be the same as the file's "inode" as shown by "ls -i" But a few obect ids exist for all datasets. Object id 0 is the DMU dnode.

The dataset here may be a ZFS filesystem, a snapshot, or I believe a few other things. I believe that if it's still in existence, you'll normally get at least its name and perhaps the full path to the object. When it's not in existence any more (perhaps you deleted the snapshot or the whole filesystem in question since the scrub detected it), you get this hex ID and there's also no information about the path.

The reason the information is presented this way is that what the ZFS code in the kernel saves and returns to the zpool command is actually just the dataset and object ID. It's up to zpool to turn both of these into names, which it actually does by calling back into the kernel to find out what they're currently called, if the kernel knows. Inspecting the relevant ZFS code says that there are five cases:

  • <metadata>:<0x...> means corruption in some object in the pool's overall metadata object set.

  • <0x...>:<0x...> means that the dataset involved can't be identified (and thus ZFS has no hope of identifying the thing inside the dataset).

  • /some/path/name means you have a corrupted filesystem object (a file, a directory, etc) in a currently mounted dataset and this is its full current path.

    (I think that ZFS's determination of the path name for a given ZFS object is pretty reliable; if I'm reading the code right, it appears to be able to scan upward in the filesystem hierarchy starting with the object itself.)

  • dsname:/some/path means that the dataset is called dsname but it's not currently mounted, and /some/path is the path within it. I think this happens for snapshots.

  • dsname:<0x...> means that it's in the given dataset dsname (which may or may not be mounted), but the ZFS object in question can't have its path identified for various reasons (including that it's already been deleted).

Only things in ZFS filesystems (and snapshots and so on) have path names, so an error in a ZVOL will always be reported without the path. I'm not sure what the reported dataset names are for ZVOLs, since I don't use ZVOLs.

The final detail is that you may see this error status in 'zpool status -v' even after you've cleaned it up. To quote Richard Elling again:

Finally, the error buffer for "zpool status" contains information for two scan passes: the current and previous scans. So it is possible to delete an object (eg file) and still see it listed in the error buffer. It takes two scans to completely update the error buffer. This is important if you go looking for a dataset+object tuple with zdb and don't find anything...

PS: There are some cases where <xattrdir> will appear in the file path. If I'm reading the code correctly, this happens when the problem is in an extended attribute instead of the filesystem object itself.

(See also this, this, and this.)

PPS: Richard Elling's message was on the ZFS on Linux mailing list and about an issue someone was having with a ZoL system, but as far as I can see the core code is basically the same in Illumos and I would expect in FreeBSD as well, so this bit of ZFS wisdom should be cross-platform.

ZFSPermanentErrorsMeaning written at 22:58:26; Add Comment


ZFS pushes file renamings and other metadata changes to disk quite promptly

One of the general open questions on Unix is when changes like renaming or creating files are actually durably on disk. Famously, some filesystems on some Unixes have been willing to delay this for an unpredictable amount of time unless you did things like fsync() the containing directory of your renamed file, not just fsync() the file itself. As it happens, ZFS's design means that it offers some surprisingly strong guarantees about this; specifically, ZFS persists all metadata changes to disk no later than the next transaction group commit. In ZFS today, a transaction group commit generally happens every five seconds, so if you do something like rename a file, your rename will be fully durable quite soon even if you do nothing special.

However, this doesn't mean that if you create a file, write data to the file, and then rename it (with no other special operations) that in five or ten seconds your new file is guaranteed to be present under its new name with all the data you wrote. Although metadata operations like creating and renaming files go to ZFS right away and then become part of the next txg commit, the kernel generally holds on to written file data for a while before pushing it out. You need some sort of fsync() in there to force the kernel to commit your data, not just your file creation and renaming. Because of how the ZFS intent log works, you don't need to do anything more than fsync() your file here; when you fsync() a file, all pending metadata changes are flushed out to disk along with the file data.

(In a 'create new version, write, rename to overwrite current version' setup, I think you want to fsync() the file twice, once after the write and then once after the rename. Otherwise you haven't necessarily forced the rename itself to be written out. You don't want to do the rename before a fsync(), because then I think that a crash at just the wrong time could give you an empty new file. But the ice is thin here in portable code, including code that wants to be portable to different filesystem types.)

My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.

ZFSWhenMetadataSynced written at 22:47:51; Add Comment


ZFS spare-N spare vdevs in your pool are mirror vdevs

Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:

   wwn-0x5000cca251b79b98    ONLINE  0  0  0
   spare-8                   ONLINE  0  0  0
     wwn-0x5000cca251c7b9d8  ONLINE  0  0  0
     wwn-0x5000cca2568314fc  ONLINE  0  0  0
   wwn-0x5000cca251ca10b0    ONLINE  0  0  0

What is this spare-8 thing, beyond 'a sign that a spare activated here'? This is sometimes called a 'spare vdev', and the answer is that spare vdevs are mirror vdevs.

Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.

(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)

What this means is that these spare-N vdevs behave like mirror vdevs. Assuming that both sides are healthy, reads can be satisfied from either side (and will be balanced back and forth as they are for mirror vdevs), writes will go to both sides, and a scrub will check both sides. As a result, if you scrub a pool with a spare-N vdev and there are no problems reported for either component device, then both old and new device are fine and contain a full and intact copy of the data. You can keep either (or both).

As a side note, it's possible to manually create your own spare-N vdevs even without a fault, because spares activation is actually a user-level thing in ZFS. Although I haven't tested this recently, you generally get a spare-N vdev if you do 'zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK> is configured as a spare in the pool. Abusing this to create long term mirrors inside raidZ vdevs is left as an exercise to the reader.

(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)

PS: As you might expect, the replacing-N vdev that you get when you replace a disk is also a mirror vdev, with the special behavior than when the resilver finishes, the original device is normally automatically detached.

ZFSSparesAreMirrors written at 22:44:19; Add Comment

(Previous 10 or go back to May 2018 at 2018/05/02)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.