Sequence scrubs and resilvers are coming for (open-source) ZFS
Oracle has made a number of changes and improvements to Solaris ZFS since they took it closed source. Mostly I've been indifferent to their changes, but the one improvement I've long envied is their sequential resilvering (and scrubbing) (this apparently first appeared in Solaris 11.2, per here and here). That ZFS scrubs and resilvers aren't sequential has long been a quiet pain point for a lot of people. Apparently it's especially bad for RAID-Z pools (perhaps because of the usual RAID-Z random read issue), but it's been an issue for us in the past with mirrors (although we managed to speed that up).
Well, there's great news here for all open source ZFS implementations, including Illumos distributions, because an implementation of sequential scrubs and resilvers just landed in ZFS on Linux in this commit (apparently it'll be included in ZoL 0.8 whenever that's released). The ZFS on Linux work was done by Tom Caputi of Datto, building on work done by Saso Kiselkov of Nexenta. Saso Kiselkov's work was presented at the 2016 OpenZFS developer summit and got an OpenZFS wiki summary page; Tom Caputi presented at the 2017 summit. Both have slides (and talk videos) if you want more information on how this works.
(It appears that the Nexenta work may be 'NEX-6068', included in NexentaStor 5.0.3. I can't find a current public source tree for Nexenta, so I don't know anything more than that.)
For how it works, I'll just quote from the commit message:
This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.
My early experience with this in the current ZoL git tree has been very positive. I saw a single-vdev mirror pool on HDs with 293 GB used go from a scrub time of two hours and 25 minutes to one hour and ten minutes.
Although this is very early days for this feature even in ZFS on Linux, I'd expect it to get pushed (or pulled) upstream later and thus go into Illumos. I have no idea when that might happen; it might be reasonable to wait until ZFS on Linux has included it in an actual release so that it sees some significant testing in the field. Or people could find this an interesting and important enough change that they actively work to bring it upstream, if only for testing there.
(At this point I haven't spotted any open issues about this in the Illumos issue tracker, but as mentioned I don't really expect that yet unless someone wants to get a head start.)
PS: Unlike Oracle's change for Solaris 11.2, which apparently needed a pool format change (Oracle version 35, according to Wikipedia), the ZFS on Linux implementation needs no new pool feature and so is fully backward compatible. I'd expect this to be true for any eventual Illumos version unless people find some hard problem that forces the addition of a new pool feature.
mountd caches netgroup lookups (relatively briefly)
Last time I covered how the Illumos NFS server caches filesystem
access permissions. However, this is not
the only level of caching that's possibly going on in the overall
NFS server ecosystem, because the Illumos NFS kernel ultimately
calls up to
mountd to find out about permissions and
have its own caching.
mountd caches netgroup membership checks for 60
seconds. Well, sort of. What it really caches is the result of
whether a host is in a specific list of netgroups, not whether or
not a host is in any particular netgroup. This may sound like a
silly distinction, but consider a NFS export (in ZFS format)
This export will always generate two cache entries, one for the
rw= set of two groups and one for the
root= single group. This
is true even if a host is in
group1 (and so gets a positive entry
in both entries). On the one hand, this probably doesn't matter too
much, as the cache has no size limits. On the other hand, the cache
is also a simple linked list, so let's hope it never grows too big.
(As you might guess from this, the cache is pretty brute force. That's probably okay.)
mountd and thus this netgroup cache gets involved in
two different situations. First you'll have
the actual NFS mount request itself from the client, which will go
straight to mountd, check the exports, and return appropriate
information to the client. Then when the client tries to actually
do an NFS operation with its shiny new mount, the kernel may or
perhaps will upcall back to mountd for another permission check.
This matters to us because of our custom NFS mount authorization scheme, which does its magic by hooking into netgroup lookups. Both negative and positive caching in mountd are a potential problem for us, although negative caching is usually worse since it means that a host with a verification glitch now has to wait roughly a minute before it can usefully retry a mount request. At the same time, some caching is definitely useful; as the comment in the source code says, mount requests often come in close bursts from the same machine (as it mounts a whole bunch of filesystems with the same export permissions), and only doing expensive things once for that burst is a clear win.
(Interested parties who want to see this particular sausage being made can look in the relevant Illumos source code. It looks like this code hasn't changed for a very long time.)
The Illumos NFS server's caching of filesystem access permissions
Years ago I wrote The Solaris 10 NFS server's caching of filesystem access permissions. I was recently digging in this area of the Illumos source code and discovered that there have been a few changes, so here is a brief update. The background is that that Illumos NFS server code, like basically all modern NFS servers, does not maintain a full list of what clients are authorized to access what filesystems. Instead it maintains a cache and upcalls to user level code whenever it feels that the cache is insufficient information.
As in Solaris 10, the Illumos kernel NFS authorization cache holds both positive and negative entries on a per-filesystem basis. However, in Illumos this cache now sort of has a timeout; if a cache entry is older then 600 seconds (ten minutes), the kernel will try to refresh it the next time the entry is used. This attempt to refresh the entry doesn't immediately cause it to expire or be revalidated; instead, it's added to a queue for the refresh thread to process. Until the refresh queue gets around to processing the entry (and gets an answer back from its upcall), the kernel will continue to use the current cached state as the best current answer.
(As in Solaris 10, the cache for a filesystem is discarded entirely if the filesystem is unshared or reshared, including being reshared with exactly the same settings.)
As far as I can tell, this refreshing only happens when the entry is used. There doesn't appear to be anything that runs around trying to revalidate old entries. So you can try a mount once, get a failure, have that failure cached in the kernel, come back a day later, try the mount again, and for at least the first access the kernel will still use that day-old cached entry unless memory pressure has pushed it out in the mean time.
(The easiest way for this to happen is for a client to try a NFS mount before it's been added to the netgroup that controls access. Merely updating the netgroup membership doesn't re-export the filesystem and thus doesn't flush the authorization cache for it.)
As far as I can tell, the refresh process is single-threaded; only
one refresh thread is started, and it only makes one upcall at a
time. The initial upcalls to
mountd (when there's no existing
authorization cache entry for a client/filesystem combination) are
done directly in the NFS authorization lookup and so there can be
several of them at once, although presumably there are limits on
simultaneous requests and so on.
The cache size continues to be unlimited and shrinks only under memory pressure (if that ever happens; it doesn't appear to on our OmniOS NFS servers). During shrinking, only cache entries that have been unused for at least 60 minutes are candidates to be discarded; entries in active use are never dropped. Entries are kept active by clients doing NFS operations to filesystems, so if you never touch a particular filesystem from a particular client, the cache entry may eventually become a candidate for eviction.
(But note that this is any NFS operation, including things like
Sidebar: Illumos NFS authorization cache stats
As in Solaris 10, the easiest way to get access to cache stats is
mdb -k. Illumos has added some additional stats beyond
nfsauth_cache_refresh counts how
many refreshes have been queued up;
exi_cache_clnt_reclaim_failed appear to count a couple of ways
that reclaims due to kernel memory pressure can fail.
There are a number of DTrace probes embedded in this whole process. I haven't looked into this enough to say anything about them, so you're going to need to read the source code.
Our frustrations with OmniOS's 'KYSTY' minimalism
OmniOS famously follows a principle called KYSTY, where OmniOS itself ships with minimal amounts of software (and the versions can be out of date). As far as I know, OmniOS CE has continued this practice, which has an obvious appeal for people trying to maintain an OS distribution on limited amounts of time (especially a LTS version, where you might be stuck patching old versions of programs that aren't supported upstream any more). All of this is well and good, but in practice the results of this KYSTY approach have been one of our significant points of frustration with OmniOS.
As sysadmins operating servers (primarily Linux ones), we have come
to expect that our systems will have a certain basic collection of
workable standard programs that we use for basic system management.
For instance, we want every system to be able to send us email, and we really want to do this
with Postfix (Exim is an acceptable substitute). Almost every system
needs a program that can talk to disks to get SMART information,
and while there are alternatives to
tcpdump, we have
everywhere else and we really want one standard program. I could
go on; there's an entire collection of things that we consider
standard that just aren't there on a baseline OmniOS machine.
(I can't not mention
We were able to mostly fix this with various third party package
sources, but the result is complicated, requires a large magic
$PATH in order to work relatively seamlessly, has gaps, and is
quietly fragile over the long term. As an example of something that
has quietly worried me, at this point there's probably no way to
exactly reproduce one of our fileservers
because it's very likely that at least some of the third party
package sources we use have moved on from the package versions we
installed. Does this matter? Probably not, which is why we didn't
spend a significant amount of effort to figure out how to get and
freeze local copies of all those packages.
(The exact version of
top that's installed is probably not important
for our NFS fileservers. We could even live without
top at all,
although it would be annoying.)
I sympathize with OmniOS here in the abstract, but in the concrete it was and is a point of friction when we work with our OmniOS machines. They're different, and from our biased perspective, gratuitously so. The result makes our life harder and leaves us less happy with OmniOS.
(I think that a great deal of the problems could be removed if there was an OmniOS CE equivalent of Ubuntu's 'universe' repository and it could easily be enabled. The main OmniOS CE developers wouldn't be responsible for maintaining software there; instead it would be open for reasonably vetted community contributions. Officially embracing pkgsrc might be another option, but I don't like that as much for various reasons.)
recordsize, holes in files, and partial blocks
Yesterday I wrote about using
zdb to peer into ZFS's on-disk
storage of files, and in particular I
wondered if you wrote a 160 Kb file, would ZFS really use two
128 Kb blocks for it. The answer appeared to be 'no', but I was
a little bit confused by some things I was seeing. In a comment,
Robert Milkowski set me right:
In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).
He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.
The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:
0 L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368 20000 L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368 segment [0000000000000000, 0000000000040000) size 256K
These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.
One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.
A more interesting test file has holes that cover an entire recordsize
block. Let's make one that has 128 Kb of data, skips the second 128 Kb
block entirely, has 32 Kb of data at the end of the third 128 Kb block,
skips the fourth 128 Kb block, and has 32 Kb of data at the end of the
fifth 128 Kb block. Set up with
dd, this is:
dd if=/dev/urandom of=testfile2 bs=128k count=1 dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc
Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:
# zdb -vv -O ssddata/homes cks/tmp/testfile2 [...] Indirect blocks: 0 L1 0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016 0 L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011 40000 L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015 80000 L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016 segment [0000000000000000, 0000000000020000) size 128K segment [0000000000040000, 0000000000060000) size 128K segment [0000000000080000, 00000000000a0000) size 128K
The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2 [...] 0 L1 DVA=<0:8a2c4e2c00:400> DVA=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...] 0 L0 DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...] 40000 L0 DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...] 80000 L0 DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
(That we need to use both
-bbbb here is due to how
zdb's code is set up, and it's rather a hack to get what we want.
I had to read the
zdb source code to work out how to make it work.)
Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.
This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).
It's sometimes possible to dump the block contents of things like
L1 indirect blocks, so you can see a more direct representation
of them. This is where it's important to know that metadata is
compressed, so we can ask
zdb to decompress it with a magic
# zdb -R ssddata 0:8a2c4e2c00:400:id [...] DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59 HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44 HOLE [L0 unallocated] size=200L birth=0L [...]
Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)
PS: I'm not using '
zdb -ddddd' today because when I dug deeper
zdb, I discovered that '
zdb -O' would already report this
information when given the right arguments, thereby saving me an
Sidebar: Why you can't always dump blocks with
To decompress a (ZFS) block, you need to know what it's compressed
with and its uncompressed size. This information is stored in
whatever metadata points to the block, not in the block itself, and
zdb -R simply guesses repeatedly until it gets a
result that appears to work out right:
# zdb -R ssddata 0:8a2c4e2c00:400:id Found vdev type: mirror Trying 00400 -> 00600 (inherit) Trying 00400 -> 00600 (on) Trying 00400 -> 00600 (uncompressed) Trying 00400 -> 00600 (lzjb) Trying 00400 -> 00600 (empty) Trying 00400 -> 00600 (gzip-1) Trying 00400 -> 00600 (gzip-2) [...] Trying 00400 -> 20000 (lz4) DVA=<0:8a4afe7e00:20000> [...]
The result that
zdb -R gets may or may not actually be correct,
and thus may or may not give you the actual decompressed block data.
Here it worked; at other times I've tried it, not so much. The last
zdb -R prints is the one it thinks is correct, so
you can at least see if it got it right (here, for example, we know
that it did, since it picked lz4 with a true logical size of 0x20000
and that's what the metadata we have about the L1 indirect block says
zdb -R would gain a way of specifying the compression
algorithm and the logical size for the
d block flag. Perhaps
zdb to peer into how ZFS stores files on disk
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.
It turns out that we can use
zdb to find out the answer to this
question and other interesting ones like it, and it's not even all
that painful. My starting point was Bruning Questions: ZFS Record
which has an example of using
zdb on a file in a test ZFS pool.
We can actually do this with a test file on a regular pool, like
- Create a test file:
cd $HOME/tmp dd if=/dev/urandom of=testfile bs=160k count=1
/dev/urandomhere to defeat ZFS compression.
zdb -Oto determine the object number of this file:
; zdb -O ssddata/homes cks/tmp/testfile Object lvl iblk dblk dsize dnsize lsize %full type 1075431 2 128K 128K 163K 512 256K 100.00 ZFS plain file
(Your version of
zdbmay be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)
zdb -dddddto dump detailed information on the object:
# zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003 segment [0000000000000000, 0000000000040000) size 256K
See Bruning Questions: ZFS Record Size for information on what the various fields mean.
ds to use with the
zdbis sort of like explosives; if it doesn't solve your problem, add more
-ds until it does. This number of
ds works with ZFS on Linux for me but you might need more.)
What we have here is two on-disk blocks. One is 0x20000 bytes long,
or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know
why it's 33 Kb instead of 32 Kb, especially since
zdb will also
report that the file has a
size of 163840 (bytes), which is exactly
160 Kb as expected. It's not the
ashift on this pool, because
this is the pool I made a little setup mistake on so it has an
ashift of 9.
Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.
That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:
; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4 # zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).
Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.
; dd if=/dev/urandom of=testfile2 bs=16k count=1 # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361 segment [0000000000000000, 0000000000004000) size 16K
Now we add the hole:
; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc [...] # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377 segment [0000000000000000, 000000000001c000) size 112K
The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.
(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)
Now that I've worked out how to use
zdb for this sort of exploration,
there's a number of questions about how ZFS stores files on disks
that I want to look into at some point, including how compression
interacts with recordsize and block sizes.
(I should probably also do some deeper exploration of what the
zdb is reporting means. I've poked around
zdb before, but always in very 'heads
down' and limited ways that didn't involve really understanding
ZFS on-disk structures.)
Update: As pointed out by Robert Milkowski in the comments,
I'm mistaken here and being fooled by compression being on in this
filesystem. See ZFS's
recordsize, holes in files, and partial blocks for the illustrated explanation of
what's really going on.
Looking back at my mixed and complicated feelings about Solaris
So Oracle killed Solaris (and SPARC) a couple of weeks ago. I can't say this is surprising, although it's certainly sudden and underhanded in the standard Oracle way. Back when Oracle killed Sun, I was sad for the death of a dream, despite having had ups and downs with Sun over the years. My views about the death of Solaris are more mixed and complicated, but I will summarize them by saying that I don't feel very sad about Solaris itself (although there are things to be sad about).
To start with, Solaris has been dead for me for a while, basically ever since Oracle bought Sun and certainly since Oracle closed the Solaris source. The Solaris that the CS department used for years in a succession of fileservers was very much a product of Sun the corporation, and I could never see Oracle's Solaris as the same thing or as a successor to it. Hearing that Oracle was doing things with Solaris was distant news; it had no relevance for us and pretty much everyone else.
(Every move Oracle made after absorbing Sun came across to me as a 'go away, we don't want your business or to expand Solaris usage' thing.)
But that's the smaller piece, because I have some personal baggage and biases around Solaris itself due to my history. I started using Sun hardware in the days of SunOS, where SunOS 3 was strikingly revolutionary and worked pretty well for the time. It was followed by SunOS 4, which was also quietly revolutionary even if the initial versions had some unfortunate performance issues on our servers (we ran SunOS 4.1 on a 4/490, complete with an unfortunate choice of disk interconnect). Then came Solaris 2, which I've described as a high speed collision between SunOS 4 and System V R4.
To people reading this today, more than a quarter century removed, this probably sounds like a mostly neutral thing or perhaps just messy (since I did call it a collision). But at the time it was a lot more. In the old days, Unix was split into two sides, the BSD side and the AT&T System III/V side, and I was firmly on the BSD side along with many other people at universities; SunOS 3 and SunOS 4 and the version of Sun that produced them were basically our standard bearers, not only for BSD's superiority at the time but also their big technical advances like NFS and unified virtual memory. When Sun turned around and produced Solaris 2, it was viewed as being tilted towards being a System V system, not a BSD system. Culturally, there was a lot of feeling that this was a betrayal and Sun had debased the nice BSD system they'd had by getting a lot of System V all over it. It didn't help that Sun was unbundling the compilers around this time, in an echo of the damage AT&T's Unix unbundling did.
(Solaris 2 was Sun's specific version of System V Release 4, which itself was the product of Sun and AT&T getting together to slam System V and BSD together into a unified hybrid. The BSD side saw System V R4 as 'System V with some BSD things slathered over top', as opposed to 'BSD with some System V things added'. This is probably an unfair characterization at a technical level, especially since SVR4 picked up a whole bunch of important BSD features.)
Had I actually used Solaris 2, I might have gotten over this cultural message and come to like and feel affection for Solaris. But I never did; our 4/490 remained on SunOS 4 and we narrowly chose SGI over Sun, sending me on a course to use Irix until we started switching to Linux in 1999 (at which point Sun wasn't competitive and Solaris felt irrelevant as a result). By the time I dealt with Solaris again in 2005, open source Unixes had clearly surpassed it for sysadmin usability; they had better installers, far better package management and patching, and so on. My feelings about Solaris never really improved from there, despite increasing involvement and use, although there were aspects I liked and of course I am very happy that Sun created ZFS, put it into Solaris 10, and then released it to the world as open source so that it could survive the death of Sun and Solaris.
The summary of all of that is that I'm glad that Sun created a number of technologies that wound up in successive versions of Solaris and I'm glad that Sun survived long enough to release them into the world, but I don't have fond feelings about Solaris itself the way that many people who were more involved with it do. I cannot mourn the death of Solaris itself the way I could for Sun, because for me Solaris was never a part of any dream.
(One part of that is that my dream of Unix was the dream of workstations, not the dream of servers. By the time Sun was doing interesting things with Solaris 10, it was clearly not the operating system of the Unix desktop any more.)
The three different names ZFS stores for each vdev disk (on Illumos)
I sort of mentioned yesterday that ZFS keeps
information on several different ways of identifying disks in pools.
To be specific, it keeps three different names or ways of identifying
each disk. You can see this with '
zdb -C' on a pool, so here's
a representative sample:
# zdb -C rpool
MOS Configuration: [...] children: type: 'disk' id: 0 guid: 15557853432972548123 path: '/dev/dsk/c3t0d0s0' devid: 'id1,sd@SATA_____INTEL_SSDSC2BB08__BTWL4114016X080KGN/a' phys_path: '/pci@0,0/pci15d9,714@1f,2/disk@0,0:a' [...]
guid is ZFS's internal identifier for the disk,
and is stored on the disk itself as part of the disk label. Since
you have to find the disk to read it, it's not something that ZFS
uses to find disks, although it is part of verifying that ZFS has
found the right one. The three actual names for the disk are reported
devid aka 'device id', and
path is straightforward; it's the filesystem path to the
device, which here is a conventional OmniOS (Illumos, Solaris)
cNtNdNsN name typical of a plain, non-multipathed disk. As this
is a directly attached SATA disk, the
phys_path shows us the
PCI information about the controller for the disk in the form of
a PCI device name. If we pulled this
disk and replaced it with a new one, both of those would stay the
same, since with a directly attached disk they're based on physical
topology and neither has changed. However, the
devid is clearly
based on the disks's identity information; it has the vendor name,
the 'product id', and the serial number (as returned by the disk
itself in response to SATA inquiry commands). This will be the same
more or less regardless of where the disk is connected to the system,
so ZFS (and anything else) can find the disk wherever it is.
(I believe that the '
id1,sd@' portion of the
devid is simply
giving us a namespace for the rest of it. See '
prtconf -v' for
another representation of all of this information and much more.)
Multipathed disks (such as the iSCSI disks on our fileservers) look somewhat different. For them, the
filesystem device name (and thus
path) looks like
identifier>d0s0, the physical path is
, and the devid_ is not particularly useful in finding
the specific physical disk because our iSCSI targets generate
synthetic disk 'serial numbers' based on their slot position (and
the target's hostname, which at least lets me see which target a
particular OmniOS-level multipathed disk is supposed to be coming
from). As it happens, I already know that OmniOS multipathing
identifies disks only by their device ids,
so all three names are functionally the same thing, just expressed
in different forms.
If you remove a disk entirely, all three of these names go away for both plain directly attached disks and multipath disks. If you replace a plain disk with a new or different one, the filesystem path and physical path will normally still work but the devid of the old disk is gone; ZFS can open the disk but will report that it has a missing or corrupt label. If you replace a multipathed disk with a new one and the true disk serial number is visible to OmniOS, all of the old names go away since they're all (partly) based on the disk's serial number, and ZFS will report the disk as missing entirely (often simply reporting it by GUID).
Sidebar: Which disk name ZFS uses when bringing up a pool
Which name or form of device identification ZFS uses is a bit
complicated. To simplify a complicated situation (see
as best I can, the normal sequence is that ZFS starts out by trying
the filesystem path but verifying the devid. If this fails, it tries
the devid, the physical path, and finally the filesystem path again
(but without verifying the devid this time).
Since ZFS verifies the disk label's GUID and other details after opening the disk, there is no risk that finding a random disk this way (for example by the physical path) will confuse ZFS. It'll just cause ZFS to report things like 'missing or corrupt disk label' instead of 'missing device'.
Things I do and don't know about how ZFS brings pools up during boot
If you import a ZFS pool explicitly, through '
zpool import', the
user-mode side of the process normally searches through all of the
available disks in order to find the component devices of the pool.
Because it does this explicit search, it will find pool devices
even if they've been shuffled around in a way that causes them to
be renamed, or even (I think) drastically transformed, for example
dd'd to a new disk. This is pretty much what you'd expect,
since ZFS can't really read what the pool thinks its configuration
is until it assembles the pool. When it imports such a pool, I
believe that ZFS rewrites the information kept about where to
find each device so that it's correct for the current
state of your system.
This is not what happens when the system boots. To the best of
my knowledge, for non-root pools the ZFS kernel
module directly reads
/etc/zfs/zpool.cache during module
initialization and converts it into a series of in-memory pool
configurations for pools, which are all in an unactivated state.
At some point, magic things attempt to activate some or all of these
pools, which causes the kernel to attempt to open all of the devices
listed as part of the pool configuration and verify that they are
indeed part of the pool. The process of opening devices only uses
the names and other identification of the devices that's in the
pool configuration; however, one identification is a 'devid', which
for many devices is basically the model and serial number of the
disk. So I believe that under at least some circumstances the kernel
will still be able to find disks that have been shuffled around,
because it will basically seek out that model plus serial number
wherever it's (now) connected to the system.
vdev_disk.c for the gory details,
but you also need to understand Illumos devids. The various device
information available for disks in a pool can be seen with '
To the best of my knowledge, this in-kernel activation makes no
attempt to hunt around on other disks to complete the pool's
configuration the way that '
zpool import' will. In theory, assuming
that finding disks by their devid works, this shouldn't matter most
or basically all of the time; if that disk is there at all, it
should be reporting its model and serial number and I think the
kernel will find it. But I don't know for sure. I also don't know
how the kernel acts if some disks take a while to show up, for
example iSCSI disks.
(I suspect that the kernel only makes one attempt at pool activation and doesn't retry things if more devices show up later. But this entire area is pretty opaque to me.)
These days you also have your root filesystems on a ZFS pool, the
root pool. There are definitely some special code paths that seem
to be invoked during boot for a ZFS root pool, but I don't have
enough knowledge of the Illumos boot time environment to understand
how they work and how they're different from the process of loading
and starting non-root pools. I used to hear that root pools were
more fragile if devices moved around and you might have to boot
from alternate media in order to explicitly '
zpool import' and
zpool export' the root pool in order to reset its device names,
but that may be only folklore and superstition at this point.
There will be no LTS release of the OmniOS Community Edition
At the end of my entry on how I was cautiously optimistic about OmniOS CE, I said:
[...] For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).
Well, I asked, and the answer is a pretty unambiguous 'no'. The OmniOS CE core team has no interest in maintaining an LTS release; any such extended support would have to come from someone else doing the work. The current OmniOS CE support plans are:
What we intend, is to support the current and previous release with an emphasis on the current release going forward from r151022.
OmniOS CE releases are planned to come out roughly every 26 weeks, ie every six months, so supporting the current and previous release means that you get a nominal year of security updates and so on (in practice less than a year).
I can't blame the OmniOS CE core team for this (and I'm not anything that I'd describe as 'disappointed'; getting not just a OmniOS CE but a OmniOS CE LTS was always a long shot). People work on what interest them, and the CE core team just doesn't use LTS releases or plan to. They're doing enough as it is to keep OmniOS alive. And for most people, upgrading from release to release is probably not a big deal.
In the short term, this means that we are not going to bother to try to upgrade from OmniOS r151014 to either the current or the next version of OmniOS CE, because the payoff of relatively temporary security support doesn't seem worth the effort. We've already been treating our fileservers as sealed appliances, so this is not something we consider a big change.
(The long term is beyond the scope of this entry.)