Wandering Thoughts

2018-02-14

Some things about ZFS block allocation and ZFS (file) record sizes

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. For a file under the recordsize, the block size turns out to be in a multiple of 512 bytes, regardless of the pool's ashift or the physical sector size of the drives the pool is using.

Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.

To talk about these sizes, I'll start with some illustrative zdb output for a file data block, as before:

 0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The first size of the three is the logical block size, before compression. This is the first size= number ('4200L' here, in hex and L for logical). This is what grows in 512-byte units up to the recordsize and so on.

The second size is the physical size after compression, if any; this is the second size= number ('4200P' here, P for physical). It's a bit weird. If the file can't be compressed, it is the same as the logical size and because the logical size goes in 512-byte units, so does this size, even on ashift=12 pools. However, if compression happens this size appears to go by the ashift, which means it doesn't necessarily go in 512-byte units. On an ashift=9 pool you'll see it go in 512-byte units (so you can have a compressed size of '400P', ie 1 KB), but the same data written in an ashift=12 pool winds up being in 4 Kb units (so you wind up with a compressed size of '1000P', ie 4 Kb).

The third size is the actual allocated size on disk, as recorded in the DVA's asize field (which is the third subfield in the DVA[0] portion). This is always in ashift-based units, even if the physical size is not. Thus you can wind up with a 20 KB DVA but a 16.5 KB 'physical' size, as in our example (the DVA is '5000' while the block physical size is '4200').

(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)

For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.

On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.

PS: I don't know if it's possible to mix vdevs with different ashifts in the same pool. If it is, I don't know how ZFS would decide what ashift to use for the physical block size. The minimum ashift in any vdev? The maximum ashift?

(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)

ZFSLogicalVsPhysicalBlockSizes written at 00:49:29; Add Comment

2018-02-04

A surprise in how ZFS grows a file's record size (at least for me)

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. If a file has more than one block, all blocks are recordsize, no more and no less. If a file is a single block, the size of this block is based on how much data has been written to the file (or technically the maximum offset that's been written to the file). However, how the block size grows as you write data to the file turns out to be somewhat surprising (which makes me very glad that I actually did some experiments to verify what I thought I knew before I wrote this entry, because I was very wrong).

Rather than involving the ashift or growing in powers of two, ZFS always grows the (logical) block size in 512-byte chunks until it reaches the filesystem recordsize. The actual physical space allocated on disk is in ashift sized units, as you'd expect, but this is not directly related to the (logical) block size used at the file level. For example, here is a 16896 byte file (of incompressible data) on an ashift=12 pool:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
4780566    1   128K  16.5K    20K     512  16.5K  100.00  ZFS plain file
[...]
0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).

In thinking about it, this makes a certain amount of sense because the ashift is really a vdev property, not a pool property, and can vary from vdev to vdev within a single pool. As a result, the actual allocated size of a given block may vary from vdev to vdev (and a block may be written to multiple vdevs if you have copies set to more than 1 or it's metadata). The file's current block size thus can't be based on the ashift, because ZFS doesn't necessarily have a single ashift to base it on; instead ZFS bases it on 512-byte sectors, even if this has to be materialized differently on different vdevs.

Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:

; dd if=/dev/zero of=testfile bs=128k count=1
[...]
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956361    1   128K   128K      0     512   128K    0.00  ZFS plain file
[...]

This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956029    1   128K   128K     1K     512   128K  100.00  ZFS plain file
[...]
0 L0 DVA[0]=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]

We have a compressed size of 1 Kb (and a 1 Kb allocation on disk, as this is an ashift=9 vdev), but once again the file block size is 128 Kb.

(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)

What this means is that ZFS has much less wasted space than I thought it did for files that are under the recordsize. Since such files grow their logical block size in 512-byte chunks, even with no compression they waste at most almost all of one physical block on disk (if you have a file that is, say, 32 Kb plus one byte, you'll have a physical block on disk with only one byte used). This has some implications for other areas of ZFS, but those are for another entry.

(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)

ZFSRecordsizeGrowth written at 22:28:55; Add Comment

2018-01-06

What ZFS gang blocks are and why they exist

If you read up on ZFS internals, sooner or later you will run across references to 'gang blocks'. For instance, they came up when I talked about what's in a DVA, where DVAs have a flag to say that they point to a gang block instead of a regular block. Gang blocks are vaguely described as being a way of fragmenting a large logical block into a bunch of separate sub-blocks.

A more on-point description can be found in the (draft) ZFS on-disk specification (PDF, via) or the source code comments about them in zio.c. I'll selectively quote from zio.c because it's easier to follow:

A gang block is a collection of small blocks that looks to the DMU like one large block. When zio_dva_allocate() cannot find a block of the requested size, due to either severe fragmentation or the pool being nearly full, it calls zio_write_gang_block() to construct the block from smaller fragments.

A gang block consists of a gang header and up to three gang members. The gang header is just like an indirect block: it's an array of block pointers. It consumes only one sector and hence is allocatable regardless of fragmentation. The gang header's bps point to its gang members, which hold the data.

[...]

Gang blocks can be nested: a gang member may itself be a gang block. Thus every gang block is a tree in which root and all interior nodes are gang headers, and the leaves are normal blocks that contain user data. The root of the gang tree is called the gang leader.

A 'gang header' contains three full block pointers, some padding, and then a trailing checksum. The whole thing is sized so that it takes up only a single 512-byte sector; I believe this means that gang headers in ashift=12 vdevs waste a bunch of space, or at least leave the remaining 3.5 Kb unused.

To understand more about gang blocks, we need to understand why they're needed. As far as I know, this comes down to the fact that ZFS files only ever have a single (logical) block size. If a file is less than the recordsize (usually 128 Kb), it's in a single logical block of the appropriate power of two size; once it hits recordsize or greater, it's in a number of recordsize'd blocks. This means that writing new data to most files normally requires allocating some size of contiguous block (up to 128 Kb, but less if the data you're writing is compressible).

(I believe that there is also metadata that's always unfragmented and may be in blocks up to 128 Kb.)

However, ZFS doesn't guarantee that a pool always has free 128 Kb chunks available, or in fact any particular size of chunk. Instead, free space can be fragmented; you might be unfortunate enough to have many gigabytes of free space, but all of it in fragments that were, say, 32 Kb and smaller. This is where ZFS needs to resort to gang blocks, basically in order to lie to itself about still writing single large blocks.

(Before I get too snarky, I should say that this lie probably simplifies the life of higher level code a fair bit. Rather than have a whole bunch of data and metadata handling code that has to deal with all sorts of fragmentation, most of ZFS can ignore the issue and then lower level IO code quietly makes it all work. Actually using gang blocks should be uncommon.)

All of this explains why the gang block bit is a property of the DVA, not of anything else. The DVA is where space gets allocated, so the DVA is where you may need to shim in a gang block instead of getting a contiguous chunk of space. Since different vdevs generally have different levels of fragmentation, whether or not you have a contiguous chunk of the necessary size will often vary from vdev to vdev, which is the DVA level again.

One quiet complication created by gang blocks is that according to comments in the source code, the gang members may not wind up on the same vdev as the gang header (although ZFS tries to keep them on the same vdev because it makes life easier). This is different from regular blocks, which are always only on a single vdev (although they may be spread across multiple disks if they're on a raidz vdev).

Gang blocks have some space overhead compared to regular blocks (in addition to being more fragmented on disk), but how much is quite dependent on the situation. Because each gang header can only point to three gang member blocks, you may wind up needing multiple levels of nested gang blocks if you have an unlucky combination of fragmented free space and a large block to write. As an example, suppose that you need to write a 128 Kb block and the pool only has 32 Kb chunks free. 128 Kb requires four 32 Kb chunks, which is more than a single gang header can point to, so you need a nested gang block; your overhead is two sectors for the two gang headers needed. If the pool was more heavily fragmented, you'd need more nested gang blocks and the overhead would go up. If the pool had a single 64 Kb chunk left, you could have written the 128 Kb with two 32 Kb chunks and the 64 Kb chunk and thus not needed the nested gang block with its additional gang header.

(Because ZFS only uses a gang block when the space required isn't available in a contiguous block, gang blocks are absolutely sure to be scattered on the disk.)

PS: As far as I can see, a pool doesn't keep any statistics on how many times gang blocks have been necessary or how many there currently are in the pool.

ZFSGangBlocks written at 02:55:39; Add Comment

2018-01-05

Confirming the behavior of file block sizes in ZFS

ZFS filesystems have a property called their recordsize, which is usually described as something like the following (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

A while back I wrote about using zdb to peer into how ZFS stores files on disk, where I looked into how ZFS stored a 160 Kb file and specifically if it really did use two 128 Kb blocks to hold it, instead of a 128 Kb block and a 32 Kb block. The answer was yes, with some additional discoveries about ZFS compression and partial blocks.

Today I wound up wondering once again if that informal description of how ZFS behaves was really truly the case. Specifically, I wondered if there were situations where ZFS could wind up with a mixture of block sizes, say a 4 Kb block that was written initially at the start of the file and then a larger block written later after a big hole in the file. If ZFS really always stored sufficiently large files with only recordsize blocks, it would have to go back to rewrite the initial 4 Kb block, which seemed a bit odd to me given ZFS's usual reluctance to rewrite things.

So I did this experiment. We start out with a 4 Kb file, sync it, verify (with zdb) that it's there on disk and looks like we expect, and then extend the file with a giant hole, writing 32 Kb at 512 Kb into the file:

dd if=/dev/urandom of=testfile bs=4k count=1
sync
[wait, check with zdb]
dd if=/dev/urandom of=testfile bs=32k seek=19 count=1 conv=notrunc
sync

The first write creates a testfile that had a ZFS file block size of 4 Kb (which zdb prints as the dblk field); this is the initial conditions we expect. We can also see a single 4 Kb data block at offset 0:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile
[...]
Indirect blocks:
     0 L0 0:204ea46a00:1000 1000L/1000P F=1 B=5401327/5401327

After writing the additional 32 Kb, zdb reports that the file's block size has jumped up to 128 Kb, the standard ZFS dataset recordsize; this again is what we expect. However, it also reports a change in the indirect blocks. They are now:

Indirect blocks:
     0 L1  0:200fdf4200:400 20000L/400P F=2 B=5401362/5401362
     0  L0 0:200fdf2e00:1400 20000L/1400P F=1 B=5401362/5401362
 80000  L0 0:200fdeaa00:8400 20000L/8400P F=1 B=5401362/5401362

The L0 indirect block that starts at file offset 0 has changed. It's been rewritten from a 4 Kb logical / 4 Kb physical block to being 128 Kb logical and 5 Kb physical (this is still an ashift=9 pool), and the TXG it was created in (the B= field) is the same as the other blocks.

So what everyone says about the ZFS recordsize is completely true. ZFS files only ever have one (logical) block size, which starts out as small as it can be and then expands out as the file gets more data (or, more technically, as the maximum offset of data in the file increases). If you push it, ZFS will rewrite existing data you're not touching in order to expand the (logical) block size out to the dataset recordsize.

If you think about it, this rewriting is not substantially different from what happens if you write 4 Kb and then write another 4 Kb after it. Just as here, ZFS will replace your initial 4 Kb data block with an 8 Kb data block; it just feels more a bit more expected because both the old and the new data falls within the first full 128 Kb recordsize block of the file.

(Apparently, every so often something in ZFS feels sufficiently odd to me that I have to go confirm it for myself, just to be sure and so I can really believe in it without any lingering doubts.)

ZFSFileRecordsizeGrowth written at 01:33:44; Add Comment

2017-12-30

Some details of ZFS DVAs and what some of their fields store

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. DVAs are generally embedded into 'block pointers', and you can find a big comment laying out the entire structure of all of this in spa.h. The two fields of a DVA that I'm interested in today are the vdev and the offset.

(The other three fields are a reserved field called GRID, a bit to say whether the DVA is for a gang block, and asize, the allocated size of the block on its vdev. The allocated size has to be a per-DVA field for various reasons. The logical size of the block and its physical size after various sorts of compression are not DVA or vdev dependent, so they're part of the overall block pointer.)

The vdev field of a DVA is straightforward; it is the index of the vdev that the block is on, starting from zero for the first vdev and counting up. Note that this is not the GUID of the vdev involved, which is what you might sort of expect given a comment that calls it the 'virtual device ID'. Using the index means that ZFS can never shuffle the order of vdevs inside a pool, since these indexes are burned into DVAs stored on disk (as far as I know, and this matches what zdb prints, eg).

The offset field tells you where to find the start of the block on the vdev in question. Because this is an offset into the vdev, not a device, different sorts of vdevs have different ways of translating this into specific disk addresses. Specifically, RAID-Z vdevs must generally translate a single incoming IO at a single offset to the offsets on multiple underlying disk devices for multiple IOs.

At this point we arrive at an interesting question, namely what units the offset is in (since there are a bunch of possible options). As far as I can tell from looking at the ZFS kernel source code, the answer is that the DVA offset is in bytes. Some sources say that it's in 512-byte sectors, but as far as I can tell this is not correct (and it's certainly not in larger units, such as the vdev's ashift).

(This doesn't restrict the size of vdevs in any important way, since the offset is a 63-bit field.)

One potentially important consequence of this is that DVA offsets are independent of the sector size of the underlying disks in vdevs. Provided that your vdev asize is large enough, it doesn't matter if you use disks with 512-byte logical sectors or the generally rarer disks with real 4k sectors (both physical and logical), and you can replace one with the other. Well, in theory, as there may be other bits of ZFS that choke on this (I don't know if ZFS's disk labels care, for example). But DVAs won't, which means that almost everything in the pool (metadata and data both) should be fine.

PS: There are additional complications for ZFS gang blocks and so on, but I'm omitting that in the interests of keeping this manageable.

ZFSDVAOffsetVdevDetails written at 01:49:19; Add Comment

2017-12-23

Our next generation of fileservers will not be based on Illumos

Our current generation of ZFS NFS fileservers are based on OmniOS. We've slowly been working on the design of our next generation for the past few months, and one of the decisions we've made is that unless something really unusual happens, we won't be using any form of Illumos as the base operating system. While we're going to continue using ZFS, we'll be basing our fileservers on either ZFS on Linux or FreeBSD (preferably ZoL, because we already run lots of Linux machines and we don't have any FreeBSD ones).

This is not directly because of uncertainties around OmniOS CE's future (or the then lack of a LTS release that I wrote about here, because it now has one). There is really no single cause that could change our minds if it was fixed or changed; instead there are multiple contributing factors. Ultimately we made our decision because we are not in love with OmniOS and we no longer think we need to run it in order to get what we really want, which is ZFS with solid NFS fileservice.

However, I feel I need to mention some major contributing factors. The largest single factor is our continued lack of confidence in Illumos's support for Intel 10G-T chipsets. As far as I can tell from the master Illumos source, nothing substantial has changed here since back in 2014, and certainly I don't consider it a good sign that the ixgbe driver still does kernel busy-waits for milliseconds at a time. We consider 10G-T absolutely essential for our next generation of fileservers and we don't want to take chances.

(If you want to see how those busy-waits happens, look at the definition of msec_delay in ixgbe_osdep.h. drv_usecwait is specifically defined to busy-wait; it's designed to be used for microsecond durations, not millisecond ones.)

Another significant contributing factor is our frustrations with OmniOS's KYSTY minimalism, which makes dealing with our OmniOS machines more painful than dealing with our Linux ones (even the Linux ones that aren't Ubuntu based). And yes, having differently named commands does matter. It's possible that another Illumos based distribution could do better here, but I don't think there's a better one for our needs and it would still leave us with our broad issues with Illumos.

It's undeniable that we have more confidence in Linux on the whole than we do in Illumos. Linux is far more widely and heavily used, generally supports more hardware (and does so more promptly), and we've already seen that Intel 10G-T cards work fine in it (we have them in a number of our existing Linux machines, where they run great). Basically the only risk area is ZFS on Linux, and we have FreeBSD as a fallback.

There are some aspects of OmniOS that I will definitely miss, most notably DTrace. Modern Linux may have more or less functional equivalents, but I don't think there's anything that's half as usable. However on the whole I have no sentimental attachments to Solaris or Illumos; I don't hate it, but I won't miss it on the whole and an all-Linux environment will make my life simpler.

(This decision is only partly related to our decision not to use a SAN in the next generation of fileservers. While we could probably use OmniOS with the local disk setup that we want, not having to worry about Illumos's hardware support for various controller hardware does make our lives simpler.)

IllumosNoFutureHere written at 00:11:10; Add Comment

2017-11-25

Sequential scrubs and resilvers are coming for (open-source) ZFS

Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).

Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.

(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)

For how it works, I'll just quote from the commit message:

This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.

My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.

Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.

(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)

PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.

ZFSSequentialScrubIsComing written at 00:08:24; Add Comment

2017-11-03

Illumos mountd caches netgroup lookups (relatively briefly)

Last time I covered how the Illumos NFS server caches filesystem access permissions. However, this is not the only level of caching that's possibly going on in the overall NFS server ecosystem, because the Illumos NFS kernel ultimately calls up to mountd to find out about permissions and mountd can have its own caching.

Specifically, mountd caches netgroup membership checks for 60 seconds. Well, sort of. What it really caches is the result of whether a host is in a specific list of netgroups, not whether or not a host is in any particular netgroup. This may sound like a silly distinction, but consider a NFS export (in ZFS format) of:

nosuid,rw=group1:group2,root=group1

This export will always generate two cache entries, one for the rw= set of two groups and one for the root= single group. This is true even if a host is in group1 (and so gets a positive entry in both entries). On the one hand, this probably doesn't matter too much, as the cache has no size limits. On the other hand, the cache is also a simple linked list, so let's hope it never grows too big.

(As you might guess from this, the cache is pretty brute force. That's probably okay.)

For NFSv3, mountd and thus this netgroup cache gets involved in two different situations. First you'll have the actual NFS mount request itself from the client, which will go straight to mountd, check the exports, and return appropriate information to the client. Then when the client tries to actually do an NFS operation with its shiny new mount, the kernel may or perhaps will upcall back to mountd for another permission check.

This matters to us because of our custom NFS mount authorization scheme, which does its magic by hooking into netgroup lookups. Both negative and positive caching in mountd are a potential problem for us, although negative caching is usually worse since it means that a host with a verification glitch now has to wait roughly a minute before it can usefully retry a mount request. At the same time, some caching is definitely useful; as the comment in the source code says, mount requests often come in close bursts from the same machine (as it mounts a whole bunch of filesystems with the same export permissions), and only doing expensive things once for that burst is a clear win.

(Interested parties who want to see this particular sausage being made can look in the relevant Illumos source code. It looks like this code hasn't changed for a very long time.)

IllumosMountdNetgroupCache written at 01:09:29; Add Comment

2017-10-30

The Illumos NFS server's caching of filesystem access permissions

Years ago I wrote The Solaris 10 NFS server's caching of filesystem access permissions. I was recently digging in this area of the Illumos source code and discovered that there have been a few changes, so here is a brief update. The background is that that Illumos NFS server code, like basically all modern NFS servers, does not maintain a full list of what clients are authorized to access what filesystems. Instead it maintains a cache and upcalls to user level code whenever it feels that the cache is insufficient information.

As in Solaris 10, the Illumos kernel NFS authorization cache holds both positive and negative entries on a per-filesystem basis. However, in Illumos this cache now sort of has a timeout; if a cache entry is older then 600 seconds (ten minutes), the kernel will try to refresh it the next time the entry is used. This attempt to refresh the entry doesn't immediately cause it to expire or be revalidated; instead, it's added to a queue for the refresh thread to process. Until the refresh queue gets around to processing the entry (and gets an answer back from its upcall), the kernel will continue to use the current cached state as the best current answer.

(As in Solaris 10, the cache for a filesystem is discarded entirely if the filesystem is unshared or reshared, including being reshared with exactly the same settings.)

As far as I can tell, this refreshing only happens when the entry is used. There doesn't appear to be anything that runs around trying to revalidate old entries. So you can try a mount once, get a failure, have that failure cached in the kernel, come back a day later, try the mount again, and for at least the first access the kernel will still use that day-old cached entry unless memory pressure has pushed it out in the mean time.

(The easiest way for this to happen is for a client to try a NFS mount before it's been added to the netgroup that controls access. Merely updating the netgroup membership doesn't re-export the filesystem and thus doesn't flush the authorization cache for it.)

As far as I can tell, the refresh process is single-threaded; only one refresh thread is started, and it only makes one upcall at a time. The initial upcalls to mountd (when there's no existing authorization cache entry for a client/filesystem combination) are done directly in the NFS authorization lookup and so there can be several of them at once, although presumably there are limits on simultaneous requests and so on.

The cache size continues to be unlimited and shrinks only under memory pressure (if that ever happens; it doesn't appear to on our OmniOS NFS servers). During shrinking, only cache entries that have been unused for at least 60 minutes are candidates to be discarded; entries in active use are never dropped. Entries are kept active by clients doing NFS operations to filesystems, so if you never touch a particular filesystem from a particular client, the cache entry may eventually become a candidate for eviction.

(But note that this is any NFS operation, including things like df.)

Sidebar: Illumos NFS authorization cache stats

As in Solaris 10, the easiest way to get access to cache stats is with mdb -k. Illumos has added some additional stats beyond nfsauth_cache_hit, nfsauth_cache_miss, and nfsauth_cache_reclaim. nfsauth_cache_refresh counts how many refreshes have been queued up; exi_cache_auth_reclaim_failed and exi_cache_clnt_reclaim_failed appear to count a couple of ways that reclaims due to kernel memory pressure can fail.

There are a number of DTrace probes embedded in this whole process. I haven't looked into this enough to say anything about them, so you're going to need to read the source code.

IllumosNFSAuthCaching written at 01:10:11; Add Comment

2017-10-24

Our frustrations with OmniOS's 'KYSTY' minimalism

OmniOS famously follows a principle called KYSTY, where OmniOS itself ships with minimal amounts of software (and the versions can be out of date). As far as I know, OmniOS CE has continued this practice, which has an obvious appeal for people trying to maintain an OS distribution on limited amounts of time (especially a LTS version, where you might be stuck patching old versions of programs that aren't supported upstream any more). All of this is well and good, but in practice the results of this KYSTY approach have been one of our significant points of frustration with OmniOS.

As sysadmins operating servers (primarily Linux ones), we have come to expect that our systems will have a certain basic collection of workable standard programs that we use for basic system management. For instance, we want every system to be able to send us email, and we really want to do this with Postfix (Exim is an acceptable substitute). Almost every system needs a program that can talk to disks to get SMART information, and while there are alternatives to tcpdump, we have tcpdump everywhere else and we really want one standard program. I could go on; there's an entire collection of things that we consider standard that just aren't there on a baseline OmniOS machine.

(I can't not mention top, though.)

We were able to mostly fix this with various third party package sources, but the result is complicated, requires a large magic $PATH in order to work relatively seamlessly, has gaps, and is quietly fragile over the long term. As an example of something that has quietly worried me, at this point there's probably no way to exactly reproduce one of our fileservers because it's very likely that at least some of the third party package sources we use have moved on from the package versions we installed. Does this matter? Probably not, which is why we didn't spend a significant amount of effort to figure out how to get and freeze local copies of all those packages.

(The exact version of top that's installed is probably not important for our NFS fileservers. We could even live without top at all, although it would be annoying.)

I sympathize with OmniOS here in the abstract, but in the concrete it was and is a point of friction when we work with our OmniOS machines. They're different, and from our biased perspective, gratuitously so. The result makes our life harder and leaves us less happy with OmniOS.

(I think that a great deal of the problems could be removed if there was an OmniOS CE equivalent of Ubuntu's 'universe' repository and it could easily be enabled. The main OmniOS CE developers wouldn't be responsible for maintaining software there; instead it would be open for reasonably vetted community contributions. Officially embracing pkgsrc might be another option, but I don't like that as much for various reasons.)

OmniOSMinimalismFrustration written at 00:41:36; Add Comment

(Previous 10 or go back to September 2017 at 2017/09/27)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.