Wandering Thoughts


Sequence scrubs and resilvers are coming for (open-source) ZFS

Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).

Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.

(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)

For how it works, I'll just quote from the commit message:

This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.

My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.

Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.

(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)

PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.

ZFSSequentialScrubIsComing written at 00:08:24; Add Comment


Illumos mountd caches netgroup lookups (relatively briefly)

Last time I covered how the Illumos NFS server caches filesystem access permissions. However, this is not the only level of caching that's possibly going on in the overall NFS server ecosystem, because the Illumos NFS kernel ultimately calls up to mountd to find out about permissions and mountd can have its own caching.

Specifically, mountd caches netgroup membership checks for 60 seconds. Well, sort of. What it really caches is the result of whether a host is in a specific list of netgroups, not whether or not a host is in any particular netgroup. This may sound like a silly distinction, but consider a NFS export (in ZFS format) of:


This export will always generate two cache entries, one for the rw= set of two groups and one for the root= single group. This is true even if a host is in group1 (and so gets a positive entry in both entries). On the one hand, this probably doesn't matter too much, as the cache has no size limits. On the other hand, the cache is also a simple linked list, so let's hope it never grows too big.

(As you might guess from this, the cache is pretty brute force. That's probably okay.)

For NFSv3, mountd and thus this netgroup cache gets involved in two different situations. First you'll have the actual NFS mount request itself from the client, which will go straight to mountd, check the exports, and return appropriate information to the client. Then when the client tries to actually do an NFS operation with its shiny new mount, the kernel may or perhaps will upcall back to mountd for another permission check.

This matters to us because of our custom NFS mount authorization scheme, which does its magic by hooking into netgroup lookups. Both negative and positive caching in mountd are a potential problem for us, although negative caching is usually worse since it means that a host with a verification glitch now has to wait roughly a minute before it can usefully retry a mount request. At the same time, some caching is definitely useful; as the comment in the source code says, mount requests often come in close bursts from the same machine (as it mounts a whole bunch of filesystems with the same export permissions), and only doing expensive things once for that burst is a clear win.

(Interested parties who want to see this particular sausage being made can look in the relevant Illumos source code. It looks like this code hasn't changed for a very long time.)

IllumosMountdNetgroupCache written at 01:09:29; Add Comment


The Illumos NFS server's caching of filesystem access permissions

Years ago I wrote The Solaris 10 NFS server's caching of filesystem access permissions. I was recently digging in this area of the Illumos source code and discovered that there have been a few changes, so here is a brief update. The background is that that Illumos NFS server code, like basically all modern NFS servers, does not maintain a full list of what clients are authorized to access what filesystems. Instead it maintains a cache and upcalls to user level code whenever it feels that the cache is insufficient information.

As in Solaris 10, the Illumos kernel NFS authorization cache holds both positive and negative entries on a per-filesystem basis. However, in Illumos this cache now sort of has a timeout; if a cache entry is older then 600 seconds (ten minutes), the kernel will try to refresh it the next time the entry is used. This attempt to refresh the entry doesn't immediately cause it to expire or be revalidated; instead, it's added to a queue for the refresh thread to process. Until the refresh queue gets around to processing the entry (and gets an answer back from its upcall), the kernel will continue to use the current cached state as the best current answer.

(As in Solaris 10, the cache for a filesystem is discarded entirely if the filesystem is unshared or reshared, including being reshared with exactly the same settings.)

As far as I can tell, this refreshing only happens when the entry is used. There doesn't appear to be anything that runs around trying to revalidate old entries. So you can try a mount once, get a failure, have that failure cached in the kernel, come back a day later, try the mount again, and for at least the first access the kernel will still use that day-old cached entry unless memory pressure has pushed it out in the mean time.

(The easiest way for this to happen is for a client to try a NFS mount before it's been added to the netgroup that controls access. Merely updating the netgroup membership doesn't re-export the filesystem and thus doesn't flush the authorization cache for it.)

As far as I can tell, the refresh process is single-threaded; only one refresh thread is started, and it only makes one upcall at a time. The initial upcalls to mountd (when there's no existing authorization cache entry for a client/filesystem combination) are done directly in the NFS authorization lookup and so there can be several of them at once, although presumably there are limits on simultaneous requests and so on.

The cache size continues to be unlimited and shrinks only under memory pressure (if that ever happens; it doesn't appear to on our OmniOS NFS servers). During shrinking, only cache entries that have been unused for at least 60 minutes are candidates to be discarded; entries in active use are never dropped. Entries are kept active by clients doing NFS operations to filesystems, so if you never touch a particular filesystem from a particular client, the cache entry may eventually become a candidate for eviction.

(But note that this is any NFS operation, including things like df.)

Sidebar: Illumos NFS authorization cache stats

As in Solaris 10, the easiest way to get access to cache stats is with mdb -k. Illumos has added some additional stats beyond nfsauth_cache_hit, nfsauth_cache_miss, and nfsauth_cache_reclaim. nfsauth_cache_refresh counts how many refreshes have been queued up; exi_cache_auth_reclaim_failed and exi_cache_clnt_reclaim_failed appear to count a couple of ways that reclaims due to kernel memory pressure can fail.

There are a number of DTrace probes embedded in this whole process. I haven't looked into this enough to say anything about them, so you're going to need to read the source code.

IllumosNFSAuthCaching written at 01:10:11; Add Comment


Our frustrations with OmniOS's 'KYSTY' minimalism

OmniOS famously follows a principle called KYSTY, where OmniOS itself ships with minimal amounts of software (and the versions can be out of date). As far as I know, OmniOS CE has continued this practice, which has an obvious appeal for people trying to maintain an OS distribution on limited amounts of time (especially a LTS version, where you might be stuck patching old versions of programs that aren't supported upstream any more). All of this is well and good, but in practice the results of this KYSTY approach have been one of our significant points of frustration with OmniOS.

As sysadmins operating servers (primarily Linux ones), we have come to expect that our systems will have a certain basic collection of workable standard programs that we use for basic system management. For instance, we want every system to be able to send us email, and we really want to do this with Postfix (Exim is an acceptable substitute). Almost every system needs a program that can talk to disks to get SMART information, and while there are alternatives to tcpdump, we have tcpdump everywhere else and we really want one standard program. I could go on; there's an entire collection of things that we consider standard that just aren't there on a baseline OmniOS machine.

(I can't not mention top, though.)

We were able to mostly fix this with various third party package sources, but the result is complicated, requires a large magic $PATH in order to work relatively seamlessly, has gaps, and is quietly fragile over the long term. As an example of something that has quietly worried me, at this point there's probably no way to exactly reproduce one of our fileservers because it's very likely that at least some of the third party package sources we use have moved on from the package versions we installed. Does this matter? Probably not, which is why we didn't spend a significant amount of effort to figure out how to get and freeze local copies of all those packages.

(The exact version of top that's installed is probably not important for our NFS fileservers. We could even live without top at all, although it would be annoying.)

I sympathize with OmniOS here in the abstract, but in the concrete it was and is a point of friction when we work with our OmniOS machines. They're different, and from our biased perspective, gratuitously so. The result makes our life harder and leaves us less happy with OmniOS.

(I think that a great deal of the problems could be removed if there was an OmniOS CE equivalent of Ubuntu's 'universe' repository and it could easily be enabled. The main OmniOS CE developers wouldn't be responsible for maintaining software there; instead it would be open for reasonably vetted community contributions. Officially embracing pkgsrc might be another option, but I don't like that as much for various reasons.)

OmniOSMinimalismFrustration written at 00:41:36; Add Comment


ZFS's recordsize, holes in files, and partial blocks

Yesterday I wrote about using zdb to peer into ZFS's on-disk storage of files, and in particular I wondered if you wrote a 160 Kb file, would ZFS really use two 128 Kb blocks for it. The answer appeared to be 'no', but I was a little bit confused by some things I was seeing. In a comment, Robert Milkowski set me right:

In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).

He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.

The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:

    0  L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368
20000  L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368

     segment [0000000000000000, 0000000000040000) size  256K

These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.

One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.

A more interesting test file has holes that cover an entire recordsize block. Let's make one that has 128 Kb of data, skips the second 128 Kb block entirely, has 32 Kb of data at the end of the third 128 Kb block, skips the fourth 128 Kb block, and has 32 Kb of data at the end of the fifth 128 Kb block. Set up with dd, this is:

dd if=/dev/urandom of=testfile2 bs=128k count=1
dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc
dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc

Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:

# zdb -vv -O ssddata/homes cks/tmp/testfile2
Indirect blocks:
     0 L1  0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016
     0  L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011
 40000  L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015
 80000  L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016

     segment [0000000000000000, 0000000000020000) size  128K
     segment [0000000000040000, 0000000000060000) size  128K
     segment [0000000000080000, 00000000000a0000) size  128K

The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2
     0 L1  DVA[0]=<0:8a2c4e2c00:400> DVA[1]=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...]
     0  L0 DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...]
 40000  L0 DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
 80000  L0 DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]

(That we need to use both -vv and -bbbb here is due to how zdb's code is set up, and it's rather a hack to get what we want. I had to read the zdb source code to work out how to make it work.)

Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.

This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).

It's sometimes possible to dump the block contents of things like L1 indirect blocks, so you can see a more direct representation of them. This is where it's important to know that metadata is compressed, so we can ask zdb to decompress it with a magic argument:

# zdb -R ssddata 0:8a2c4e2c00:400:id
DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44
HOLE [L0 unallocated] size=200L birth=0L

Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)

PS: I'm not using 'zdb -ddddd' today because when I dug deeper into zdb, I discovered that 'zdb -O' would already report this information when given the right arguments, thereby saving me an annoying step.

Sidebar: Why you can't always dump blocks with zdb -R

To decompress a (ZFS) block, you need to know what it's compressed with and its uncompressed size. This information is stored in whatever metadata points to the block, not in the block itself, and so currently zdb -R simply guesses repeatedly until it gets a result that appears to work out right:

# zdb -R ssddata 0:8a2c4e2c00:400:id
Found vdev type: mirror
Trying 00400 -> 00600 (inherit)
Trying 00400 -> 00600 (on)
Trying 00400 -> 00600 (uncompressed)
Trying 00400 -> 00600 (lzjb)
Trying 00400 -> 00600 (empty)
Trying 00400 -> 00600 (gzip-1)
Trying 00400 -> 00600 (gzip-2)
Trying 00400 -> 20000 (lz4)
DVA[0]=<0:8a4afe7e00:20000> [...]

The result that zdb -R gets may or may not actually be correct, and thus may or may not give you the actual decompressed block data. Here it worked; at other times I've tried it, not so much. The last 'Trying' that zdb -R prints is the one it thinks is correct, so you can at least see if it got it right (here, for example, we know that it did, since it picked lz4 with a true logical size of 0x20000 and that's what the metadata we have about the L1 indirect block says it is).

Ideally zdb -R would gain a way of specifying the compression algorithm and the logical size for the d block flag. Perhaps someday.

ZFSFilePartialAndHoleStorage written at 00:14:11; Add Comment


Using zdb to peer into how ZFS stores files on disk

If you've read much about ZFS and ZFS performance tuning, one of the things you'll have run across is the ZFS recordsize. The usual way it's described is, for example (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.

It turns out that we can use zdb to find out the answer to this question and other interesting ones like it, and it's not even all that painful. My starting point was Bruning Questions: ZFS Record Size, which has an example of using zdb on a file in a test ZFS pool. We can actually do this with a test file on a regular pool, like so:

  • Create a test file:
    cd $HOME/tmp
    dd if=/dev/urandom of=testfile bs=160k count=1

    I'm using /dev/urandom here to defeat ZFS compression.

  • Use zdb -O to determine the object number of this file:
    ; zdb -O ssddata/homes cks/tmp/testfile
      Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     1075431    2   128K   128K   163K     512   256K  100.00  ZFS plain file

    (Your version of zdb may be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)

  • Use zdb -ddddd to dump detailed information on the object:
    # zdb -ddddd ssddata/homes 1075431
         0  L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003
     20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
         segment [0000000000000000, 0000000000040000) size  256K

    See Bruning Questions: ZFS Record Size for information on what the various fields mean.

    (How many ds to use with the -d option for zdb is sort of like explosives; if it doesn't solve your problem, add more -ds until it does. This number of ds works with ZFS on Linux for me but you might need more.)

What we have here is two on-disk blocks. One is 0x20000 bytes long, or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know why it's 33 Kb instead of 32 Kb, especially since zdb will also report that the file has a size of 163840 (bytes), which is exactly 160 Kb as expected. It's not the ashift on this pool, because this is the pool I made a little setup mistake on so it has an ashift of 9.

Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.

That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:

; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4
# zdb -ddddd ssddata/homes 1075431
     0  L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207
 20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003

As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).

Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.

; dd if=/dev/urandom of=testfile2 bs=16k count=1
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361

      segment [0000000000000000, 0000000000004000) size   16K

Now we add the hole:

; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377

      segment [0000000000000000, 000000000001c000) size  112K

The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.

(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)

Now that I've worked out how to use zdb for this sort of exploration, there's a number of questions about how ZFS stores files on disks that I want to look into at some point, including how compression interacts with recordsize and block sizes.

(I should probably also do some deeper exploration of what the various information zdb is reporting means. I've poked around with zdb before, but always in very 'heads down' and limited ways that didn't involve really understanding ZFS on-disk structures.)

Update: As pointed out by Robert Milkowski in the comments, I'm mistaken here and being fooled by compression being on in this filesystem. See ZFS's recordsize, holes in files, and partial blocks for the illustrated explanation of what's really going on.

ZFSZdbForFileAnalysis written at 01:18:03; Add Comment


Looking back at my mixed and complicated feelings about Solaris

So Oracle killed Solaris (and SPARC) a couple of weeks ago. I can't say this is surprising, although it's certainly sudden and underhanded in the standard Oracle way. Back when Oracle killed Sun, I was sad for the death of a dream, despite having had ups and downs with Sun over the years. My views about the death of Solaris are more mixed and complicated, but I will summarize them by saying that I don't feel very sad about Solaris itself (although there are things to be sad about).

To start with, Solaris has been dead for me for a while, basically ever since Oracle bought Sun and certainly since Oracle closed the Solaris source. The Solaris that the CS department used for years in a succession of fileservers was very much a product of Sun the corporation, and I could never see Oracle's Solaris as the same thing or as a successor to it. Hearing that Oracle was doing things with Solaris was distant news; it had no relevance for us and pretty much everyone else.

(Every move Oracle made after absorbing Sun came across to me as a 'go away, we don't want your business or to expand Solaris usage' thing.)

But that's the smaller piece, because I have some personal baggage and biases around Solaris itself due to my history. I started using Sun hardware in the days of SunOS, where SunOS 3 was strikingly revolutionary and worked pretty well for the time. It was followed by SunOS 4, which was also quietly revolutionary even if the initial versions had some unfortunate performance issues on our servers (we ran SunOS 4.1 on a 4/490, complete with an unfortunate choice of disk interconnect). Then came Solaris 2, which I've described as a high speed collision between SunOS 4 and System V R4.

To people reading this today, more than a quarter century removed, this probably sounds like a mostly neutral thing or perhaps just messy (since I did call it a collision). But at the time it was a lot more. In the old days, Unix was split into two sides, the BSD side and the AT&T System III/V side, and I was firmly on the BSD side along with many other people at universities; SunOS 3 and SunOS 4 and the version of Sun that produced them were basically our standard bearers, not only for BSD's superiority at the time but also their big technical advances like NFS and unified virtual memory. When Sun turned around and produced Solaris 2, it was viewed as being tilted towards being a System V system, not a BSD system. Culturally, there was a lot of feeling that this was a betrayal and Sun had debased the nice BSD system they'd had by getting a lot of System V all over it. It didn't help that Sun was unbundling the compilers around this time, in an echo of the damage AT&T's Unix unbundling did.

(Solaris 2 was Sun's specific version of System V Release 4, which itself was the product of Sun and AT&T getting together to slam System V and BSD together into a unified hybrid. The BSD side saw System V R4 as 'System V with some BSD things slathered over top', as opposed to 'BSD with some System V things added'. This is probably an unfair characterization at a technical level, especially since SVR4 picked up a whole bunch of important BSD features.)

Had I actually used Solaris 2, I might have gotten over this cultural message and come to like and feel affection for Solaris. But I never did; our 4/490 remained on SunOS 4 and we narrowly chose SGI over Sun, sending me on a course to use Irix until we started switching to Linux in 1999 (at which point Sun wasn't competitive and Solaris felt irrelevant as a result). By the time I dealt with Solaris again in 2005, open source Unixes had clearly surpassed it for sysadmin usability; they had better installers, far better package management and patching, and so on. My feelings about Solaris never really improved from there, despite increasing involvement and use, although there were aspects I liked and of course I am very happy that Sun created ZFS, put it into Solaris 10, and then released it to the world as open source so that it could survive the death of Sun and Solaris.

The summary of all of that is that I'm glad that Sun created a number of technologies that wound up in successive versions of Solaris and I'm glad that Sun survived long enough to release them into the world, but I don't have fond feelings about Solaris itself the way that many people who were more involved with it do. I cannot mourn the death of Solaris itself the way I could for Sun, because for me Solaris was never a part of any dream.

(One part of that is that my dream of Unix was the dream of workstations, not the dream of servers. By the time Sun was doing interesting things with Solaris 10, it was clearly not the operating system of the Unix desktop any more.)

(On Solaris's death in general, see this and this.)

SolarisMixedFeelings written at 23:34:48; Add Comment


The three different names ZFS stores for each vdev disk (on Illumos)

I sort of mentioned yesterday that ZFS keeps information on several different ways of identifying disks in pools. To be specific, it keeps three different names or ways of identifying each disk. You can see this with 'zdb -C' on a pool, so here's a representative sample:

# zdb -C rpool
MOS Configuration:
    type: 'disk'
    id: 0
    guid: 15557853432972548123
    path: '/dev/dsk/c3t0d0s0'
    devid: 'id1,sd@SATA_____INTEL_SSDSC2BB08__BTWL4114016X080KGN/a'
    phys_path: '/pci@0,0/pci15d9,714@1f,2/disk@0,0:a'

The guid is ZFS's internal identifier for the disk, and is stored on the disk itself as part of the disk label. Since you have to find the disk to read it, it's not something that ZFS uses to find disks, although it is part of verifying that ZFS has found the right one. The three actual names for the disk are reported here as path, devid aka 'device id', and phys_path aka 'physical path'.

The path is straightforward; it's the filesystem path to the device, which here is a conventional OmniOS (Illumos, Solaris) cNtNdNsN name typical of a plain, non-multipathed disk. As this is a directly attached SATA disk, the phys_path shows us the PCI information about the controller for the disk in the form of a PCI device name. If we pulled this disk and replaced it with a new one, both of those would stay the same, since with a directly attached disk they're based on physical topology and neither has changed. However, the devid is clearly based on the disks's identity information; it has the vendor name, the 'product id', and the serial number (as returned by the disk itself in response to SATA inquiry commands). This will be the same more or less regardless of where the disk is connected to the system, so ZFS (and anything else) can find the disk wherever it is.

(I believe that the 'id1,sd@' portion of the devid is simply giving us a namespace for the rest of it. See 'prtconf -v' for another representation of all of this information and much more.)

Multipathed disks (such as the iSCSI disks on our fileservers) look somewhat different. For them, the filesystem device name (and thus path) looks like c5t<long identifier>d0s0, the physical path is /scsivhci/disk@g<long identifier>, and the devid_ is not particularly useful in finding the specific physical disk because our iSCSI targets generate synthetic disk 'serial numbers' based on their slot position (and the target's hostname, which at least lets me see which target a particular OmniOS-level multipathed disk is supposed to be coming from). As it happens, I already know that OmniOS multipathing identifies disks only by their device ids, so all three names are functionally the same thing, just expressed in different forms.

If you remove a disk entirely, all three of these names go away for both plain directly attached disks and multipath disks. If you replace a plain disk with a new or different one, the filesystem path and physical path will normally still work but the devid of the old disk is gone; ZFS can open the disk but will report that it has a missing or corrupt label. If you replace a multipathed disk with a new one and the true disk serial number is visible to OmniOS, all of the old names go away since they're all (partly) based on the disk's serial number, and ZFS will report the disk as missing entirely (often simply reporting it by GUID).

Sidebar: Which disk name ZFS uses when bringing up a pool

Which name or form of device identification ZFS uses is a bit complicated. To simplify a complicated situation (see vdev_disk_open in vdev_disk.c) as best I can, the normal sequence is that ZFS starts out by trying the filesystem path but verifying the devid. If this fails, it tries the devid, the physical path, and finally the filesystem path again (but without verifying the devid this time).

Since ZFS verifies the disk label's GUID and other details after opening the disk, there is no risk that finding a random disk this way (for example by the physical path) will confuse ZFS. It'll just cause ZFS to report things like 'missing or corrupt disk label' instead of 'missing device'.

ZFSDiskNames written at 23:47:46; Add Comment

Things I do and don't know about how ZFS brings pools up during boot

If you import a ZFS pool explicitly, through 'zpool import', the user-mode side of the process normally searches through all of the available disks in order to find the component devices of the pool. Because it does this explicit search, it will find pool devices even if they've been shuffled around in a way that causes them to be renamed, or even (I think) drastically transformed, for example by being dd'd to a new disk. This is pretty much what you'd expect, since ZFS can't really read what the pool thinks its configuration is until it assembles the pool. When it imports such a pool, I believe that ZFS rewrites the information kept about where to find each device so that it's correct for the current state of your system.

This is not what happens when the system boots. To the best of my knowledge, for non-root pools the ZFS kernel module directly reads /etc/zfs/zpool.cache during module initialization and converts it into a series of in-memory pool configurations for pools, which are all in an unactivated state. At some point, magic things attempt to activate some or all of these pools, which causes the kernel to attempt to open all of the devices listed as part of the pool configuration and verify that they are indeed part of the pool. The process of opening devices only uses the names and other identification of the devices that's in the pool configuration; however, one identification is a 'devid', which for many devices is basically the model and serial number of the disk. So I believe that under at least some circumstances the kernel will still be able to find disks that have been shuffled around, because it will basically seek out that model plus serial number wherever it's (now) connected to the system.

(See vdev_disk_open in vdev_disk.c for the gory details, but you also need to understand Illumos devids. The various device information available for disks in a pool can be seen with 'zdb -C <pool>'.)

To the best of my knowledge, this in-kernel activation makes no attempt to hunt around on other disks to complete the pool's configuration the way that 'zpool import' will. In theory, assuming that finding disks by their devid works, this shouldn't matter most or basically all of the time; if that disk is there at all, it should be reporting its model and serial number and I think the kernel will find it. But I don't know for sure. I also don't know how the kernel acts if some disks take a while to show up, for example iSCSI disks.

(I suspect that the kernel only makes one attempt at pool activation and doesn't retry things if more devices show up later. But this entire area is pretty opaque to me.)

These days you also have your root filesystems on a ZFS pool, the root pool. There are definitely some special code paths that seem to be invoked during boot for a ZFS root pool, but I don't have enough knowledge of the Illumos boot time environment to understand how they work and how they're different from the process of loading and starting non-root pools. I used to hear that root pools were more fragile if devices moved around and you might have to boot from alternate media in order to explicitly 'zpool import' and 'zpool export' the root pool in order to reset its device names, but that may be only folklore and superstition at this point.

ZFSPoolBootUnknowns written at 00:36:46; Add Comment


There will be no LTS release of the OmniOS Community Edition

At the end of my entry on how I was cautiously optimistic about OmniOS CE, I said:

[...] For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).

Well, I asked, and the answer is a pretty unambiguous 'no'. The OmniOS CE core team has no interest in maintaining an LTS release; any such extended support would have to come from someone else doing the work. The current OmniOS CE support plans are:

What we intend, is to support the current and previous release with an emphasis on the current release going forward from r151022.

OmniOS CE releases are planned to come out roughly every 26 weeks, ie every six months, so supporting the current and previous release means that you get a nominal year of security updates and so on (in practice less than a year).

I can't blame the OmniOS CE core team for this (and I'm not anything that I'd describe as 'disappointed'; getting not just a OmniOS CE but a OmniOS CE LTS was always a long shot). People work on what interest them, and the CE core team just doesn't use LTS releases or plan to. They're doing enough as it is to keep OmniOS alive. And for most people, upgrading from release to release is probably not a big deal.

In the short term, this means that we are not going to bother to try to upgrade from OmniOS r151014 to either the current or the next version of OmniOS CE, because the payoff of relatively temporary security support doesn't seem worth the effort. We've already been treating our fileservers as sealed appliances, so this is not something we consider a big change.

(The long term is beyond the scope of this entry.)

OmniOSCENoLTSVersion written at 01:09:13; Add Comment

(Previous 10 or go back to July 2017 at 2017/07/24)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.