Wandering Thoughts


ZFS's recordsize, holes in files, and partial blocks

Yesterday I wrote about using zdb to peer into ZFS's on-disk storage of files, and in particular I wondered if you wrote a 160 Kb file, would ZFS really use two 128 Kb blocks for it. The answer appeared to be 'no', but I was a little bit confused by some things I was seeing. In a comment, Robert Milkowski set me right:

In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).

He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.

The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:

    0  L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368
20000  L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368

     segment [0000000000000000, 0000000000040000) size  256K

These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.

One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.

A more interesting test file has holes that cover an entire recordsize block. Let's make one that has 128 Kb of data, skips the second 128 Kb block entirely, has 32 Kb of data at the end of the third 128 Kb block, skips the fourth 128 Kb block, and has 32 Kb of data at the end of the fifth 128 Kb block. Set up with dd, this is:

dd if=/dev/urandom of=testfile2 bs=128k count=1
dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc
dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc

Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:

# zdb -vv -O ssddata/homes cks/tmp/testfile2
Indirect blocks:
     0 L1  0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016
     0  L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011
 40000  L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015
 80000  L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016

     segment [0000000000000000, 0000000000020000) size  128K
     segment [0000000000040000, 0000000000060000) size  128K
     segment [0000000000080000, 00000000000a0000) size  128K

The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2
     0 L1  DVA[0]=<0:8a2c4e2c00:400> DVA[1]=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...]
     0  L0 DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...]
 40000  L0 DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
 80000  L0 DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]

(That we need to use both -vv and -bbbb here is due to how zdb's code is set up, and it's rather a hack to get what we want. I had to read the zdb source code to work out how to make it work.)

Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.

This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).

It's sometimes possible to dump the block contents of things like L1 indirect blocks, so you can see a more direct representation of them. This is where it's important to know that metadata is compressed, so we can ask zdb to decompress it with a magic argument:

# zdb -R ssddata 0:8a2c4e2c00:400:id
DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44
HOLE [L0 unallocated] size=200L birth=0L

Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)

PS: I'm not using 'zdb -ddddd' today because when I dug deeper into zdb, I discovered that 'zdb -O' would already report this information when given the right arguments, thereby saving me an annoying step.

Sidebar: Why you can't always dump blocks with zdb -R

To decompress a (ZFS) block, you need to know what it's compressed with and its uncompressed size. This information is stored in whatever metadata points to the block, not in the block itself, and so currently zdb -R simply guesses repeatedly until it gets a result that appears to work out right:

# zdb -R ssddata 0:8a2c4e2c00:400:id
Found vdev type: mirror
Trying 00400 -> 00600 (inherit)
Trying 00400 -> 00600 (on)
Trying 00400 -> 00600 (uncompressed)
Trying 00400 -> 00600 (lzjb)
Trying 00400 -> 00600 (empty)
Trying 00400 -> 00600 (gzip-1)
Trying 00400 -> 00600 (gzip-2)
Trying 00400 -> 20000 (lz4)
DVA[0]=<0:8a4afe7e00:20000> [...]

The result that zdb -R gets may or may not actually be correct, and thus may or may not give you the actual decompressed block data. Here it worked; at other times I've tried it, not so much. The last 'Trying' that zdb -R prints is the one it thinks is correct, so you can at least see if it got it right (here, for example, we know that it did, since it picked lz4 with a true logical size of 0x20000 and that's what the metadata we have about the L1 indirect block says it is).

Ideally zdb -R would gain a way of specifying the compression algorithm and the logical size for the d block flag. Perhaps someday.

ZFSFilePartialAndHoleStorage written at 00:14:11; Add Comment


Using zdb to peer into how ZFS stores files on disk

If you've read much about ZFS and ZFS performance tuning, one of the things you'll have run across is the ZFS recordsize. The usual way it's described is, for example (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.

It turns out that we can use zdb to find out the answer to this question and other interesting ones like it, and it's not even all that painful. My starting point was Bruning Questions: ZFS Record Size, which has an example of using zdb on a file in a test ZFS pool. We can actually do this with a test file on a regular pool, like so:

  • Create a test file:
    cd $HOME/tmp
    dd if=/dev/urandom of=testfile bs=160k count=1

    I'm using /dev/urandom here to defeat ZFS compression.

  • Use zdb -O to determine the object number of this file:
    ; zdb -O ssddata/homes cks/tmp/testfile
      Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     1075431    2   128K   128K   163K     512   256K  100.00  ZFS plain file

    (Your version of zdb may be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)

  • Use zdb -ddddd to dump detailed information on the object:
    # zdb -ddddd ssddata/homes 1075431
         0  L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003
     20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
         segment [0000000000000000, 0000000000040000) size  256K

    See Bruning Questions: ZFS Record Size for information on what the various fields mean.

    (How many ds to use with the -d option for zdb is sort of like explosives; if it doesn't solve your problem, add more -ds until it does. This number of ds works with ZFS on Linux for me but you might need more.)

What we have here is two on-disk blocks. One is 0x20000 bytes long, or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know why it's 33 Kb instead of 32 Kb, especially since zdb will also report that the file has a size of 163840 (bytes), which is exactly 160 Kb as expected. It's not the ashift on this pool, because this is the pool I made a little setup mistake on so it has an ashift of 9.

Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.

That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:

; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4
# zdb -ddddd ssddata/homes 1075431
     0  L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207
 20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003

As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).

Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.

; dd if=/dev/urandom of=testfile2 bs=16k count=1
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361

      segment [0000000000000000, 0000000000004000) size   16K

Now we add the hole:

; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377

      segment [0000000000000000, 000000000001c000) size  112K

The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.

(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)

Now that I've worked out how to use zdb for this sort of exploration, there's a number of questions about how ZFS stores files on disks that I want to look into at some point, including how compression interacts with recordsize and block sizes.

(I should probably also do some deeper exploration of what the various information zdb is reporting means. I've poked around with zdb before, but always in very 'heads down' and limited ways that didn't involve really understanding ZFS on-disk structures.)

Update: As pointed out by Robert Milkowski in the comments, I'm mistaken here and being fooled by compression being on in this filesystem. See ZFS's recordsize, holes in files, and partial blocks for the illustrated explanation of what's really going on.

ZFSZdbForFileAnalysis written at 01:18:03; Add Comment


Looking back at my mixed and complicated feelings about Solaris

So Oracle killed Solaris (and SPARC) a couple of weeks ago. I can't say this is surprising, although it's certainly sudden and underhanded in the standard Oracle way. Back when Oracle killed Sun, I was sad for the death of a dream, despite having had ups and downs with Sun over the years. My views about the death of Solaris are more mixed and complicated, but I will summarize them by saying that I don't feel very sad about Solaris itself (although there are things to be sad about).

To start with, Solaris has been dead for me for a while, basically ever since Oracle bought Sun and certainly since Oracle closed the Solaris source. The Solaris that the CS department used for years in a succession of fileservers was very much a product of Sun the corporation, and I could never see Oracle's Solaris as the same thing or as a successor to it. Hearing that Oracle was doing things with Solaris was distant news; it had no relevance for us and pretty much everyone else.

(Every move Oracle made after absorbing Sun came across to me as a 'go away, we don't want your business or to expand Solaris usage' thing.)

But that's the smaller piece, because I have some personal baggage and biases around Solaris itself due to my history. I started using Sun hardware in the days of SunOS, where SunOS 3 was strikingly revolutionary and worked pretty well for the time. It was followed by SunOS 4, which was also quietly revolutionary even if the initial versions had some unfortunate performance issues on our servers (we ran SunOS 4.1 on a 4/490, complete with an unfortunate choice of disk interconnect). Then came Solaris 2, which I've described as a high speed collision between SunOS 4 and System V R4.

To people reading this today, more than a quarter century removed, this probably sounds like a mostly neutral thing or perhaps just messy (since I did call it a collision). But at the time it was a lot more. In the old days, Unix was split into two sides, the BSD side and the AT&T System III/V side, and I was firmly on the BSD side along with many other people at universities; SunOS 3 and SunOS 4 and the version of Sun that produced them were basically our standard bearers, not only for BSD's superiority at the time but also their big technical advances like NFS and unified virtual memory. When Sun turned around and produced Solaris 2, it was viewed as being tilted towards being a System V system, not a BSD system. Culturally, there was a lot of feeling that this was a betrayal and Sun had debased the nice BSD system they'd had by getting a lot of System V all over it. It didn't help that Sun was unbundling the compilers around this time, in an echo of the damage AT&T's Unix unbundling did.

(Solaris 2 was Sun's specific version of System V Release 4, which itself was the product of Sun and AT&T getting together to slam System V and BSD together into a unified hybrid. The BSD side saw System V R4 as 'System V with some BSD things slathered over top', as opposed to 'BSD with some System V things added'. This is probably an unfair characterization at a technical level, especially since SVR4 picked up a whole bunch of important BSD features.)

Had I actually used Solaris 2, I might have gotten over this cultural message and come to like and feel affection for Solaris. But I never did; our 4/490 remained on SunOS 4 and we narrowly chose SGI over Sun, sending me on a course to use Irix until we started switching to Linux in 1999 (at which point Sun wasn't competitive and Solaris felt irrelevant as a result). By the time I dealt with Solaris again in 2005, open source Unixes had clearly surpassed it for sysadmin usability; they had better installers, far better package management and patching, and so on. My feelings about Solaris never really improved from there, despite increasing involvement and use, although there were aspects I liked and of course I am very happy that Sun created ZFS, put it into Solaris 10, and then released it to the world as open source so that it could survive the death of Sun and Solaris.

The summary of all of that is that I'm glad that Sun created a number of technologies that wound up in successive versions of Solaris and I'm glad that Sun survived long enough to release them into the world, but I don't have fond feelings about Solaris itself the way that many people who were more involved with it do. I cannot mourn the death of Solaris itself the way I could for Sun, because for me Solaris was never a part of any dream.

(One part of that is that my dream of Unix was the dream of workstations, not the dream of servers. By the time Sun was doing interesting things with Solaris 10, it was clearly not the operating system of the Unix desktop any more.)

(On Solaris's death in general, see this and this.)

SolarisMixedFeelings written at 23:34:48; Add Comment


The three different names ZFS stores for each vdev disk (on Illumos)

I sort of mentioned yesterday that ZFS keeps information on several different ways of identifying disks in pools. To be specific, it keeps three different names or ways of identifying each disk. You can see this with 'zdb -C' on a pool, so here's a representative sample:

# zdb -C rpool
MOS Configuration:
    type: 'disk'
    id: 0
    guid: 15557853432972548123
    path: '/dev/dsk/c3t0d0s0'
    devid: 'id1,sd@SATA_____INTEL_SSDSC2BB08__BTWL4114016X080KGN/a'
    phys_path: '/pci@0,0/pci15d9,714@1f,2/disk@0,0:a'

The guid is ZFS's internal identifier for the disk, and is stored on the disk itself as part of the disk label. Since you have to find the disk to read it, it's not something that ZFS uses to find disks, although it is part of verifying that ZFS has found the right one. The three actual names for the disk are reported here as path, devid aka 'device id', and phys_path aka 'physical path'.

The path is straightforward; it's the filesystem path to the device, which here is a conventional OmniOS (Illumos, Solaris) cNtNdNsN name typical of a plain, non-multipathed disk. As this is a directly attached SATA disk, the phys_path shows us the PCI information about the controller for the disk in the form of a PCI device name. If we pulled this disk and replaced it with a new one, both of those would stay the same, since with a directly attached disk they're based on physical topology and neither has changed. However, the devid is clearly based on the disks's identity information; it has the vendor name, the 'product id', and the serial number (as returned by the disk itself in response to SATA inquiry commands). This will be the same more or less regardless of where the disk is connected to the system, so ZFS (and anything else) can find the disk wherever it is.

(I believe that the 'id1,sd@' portion of the devid is simply giving us a namespace for the rest of it. See 'prtconf -v' for another representation of all of this information and much more.)

Multipathed disks (such as the iSCSI disks on our fileservers) look somewhat different. For them, the filesystem device name (and thus path) looks like c5t<long identifier>d0s0, the physical path is /scsivhci/disk@g<long identifier>, and the devid_ is not particularly useful in finding the specific physical disk because our iSCSI targets generate synthetic disk 'serial numbers' based on their slot position (and the target's hostname, which at least lets me see which target a particular OmniOS-level multipathed disk is supposed to be coming from). As it happens, I already know that OmniOS multipathing identifies disks only by their device ids, so all three names are functionally the same thing, just expressed in different forms.

If you remove a disk entirely, all three of these names go away for both plain directly attached disks and multipath disks. If you replace a plain disk with a new or different one, the filesystem path and physical path will normally still work but the devid of the old disk is gone; ZFS can open the disk but will report that it has a missing or corrupt label. If you replace a multipathed disk with a new one and the true disk serial number is visible to OmniOS, all of the old names go away since they're all (partly) based on the disk's serial number, and ZFS will report the disk as missing entirely (often simply reporting it by GUID).

Sidebar: Which disk name ZFS uses when bringing up a pool

Which name or form of device identification ZFS uses is a bit complicated. To simplify a complicated situation (see vdev_disk_open in vdev_disk.c) as best I can, the normal sequence is that ZFS starts out by trying the filesystem path but verifying the devid. If this fails, it tries the devid, the physical path, and finally the filesystem path again (but without verifying the devid this time).

Since ZFS verifies the disk label's GUID and other details after opening the disk, there is no risk that finding a random disk this way (for example by the physical path) will confuse ZFS. It'll just cause ZFS to report things like 'missing or corrupt disk label' instead of 'missing device'.

ZFSDiskNames written at 23:47:46; Add Comment

Things I do and don't know about how ZFS brings pools up during boot

If you import a ZFS pool explicitly, through 'zpool import', the user-mode side of the process normally searches through all of the available disks in order to find the component devices of the pool. Because it does this explicit search, it will find pool devices even if they've been shuffled around in a way that causes them to be renamed, or even (I think) drastically transformed, for example by being dd'd to a new disk. This is pretty much what you'd expect, since ZFS can't really read what the pool thinks its configuration is until it assembles the pool. When it imports such a pool, I believe that ZFS rewrites the information kept about where to find each device so that it's correct for the current state of your system.

This is not what happens when the system boots. To the best of my knowledge, for non-root pools the ZFS kernel module directly reads /etc/zfs/zpool.cache during module initialization and converts it into a series of in-memory pool configurations for pools, which are all in an unactivated state. At some point, magic things attempt to activate some or all of these pools, which causes the kernel to attempt to open all of the devices listed as part of the pool configuration and verify that they are indeed part of the pool. The process of opening devices only uses the names and other identification of the devices that's in the pool configuration; however, one identification is a 'devid', which for many devices is basically the model and serial number of the disk. So I believe that under at least some circumstances the kernel will still be able to find disks that have been shuffled around, because it will basically seek out that model plus serial number wherever it's (now) connected to the system.

(See vdev_disk_open in vdev_disk.c for the gory details, but you also need to understand Illumos devids. The various device information available for disks in a pool can be seen with 'zdb -C <pool>'.)

To the best of my knowledge, this in-kernel activation makes no attempt to hunt around on other disks to complete the pool's configuration the way that 'zpool import' will. In theory, assuming that finding disks by their devid works, this shouldn't matter most or basically all of the time; if that disk is there at all, it should be reporting its model and serial number and I think the kernel will find it. But I don't know for sure. I also don't know how the kernel acts if some disks take a while to show up, for example iSCSI disks.

(I suspect that the kernel only makes one attempt at pool activation and doesn't retry things if more devices show up later. But this entire area is pretty opaque to me.)

These days you also have your root filesystems on a ZFS pool, the root pool. There are definitely some special code paths that seem to be invoked during boot for a ZFS root pool, but I don't have enough knowledge of the Illumos boot time environment to understand how they work and how they're different from the process of loading and starting non-root pools. I used to hear that root pools were more fragile if devices moved around and you might have to boot from alternate media in order to explicitly 'zpool import' and 'zpool export' the root pool in order to reset its device names, but that may be only folklore and superstition at this point.

ZFSPoolBootUnknowns written at 00:36:46; Add Comment


There will be no LTS release of the OmniOS Community Edition

At the end of my entry on how I was cautiously optimistic about OmniOS CE, I said:

[...] For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).

Well, I asked, and the answer is a pretty unambiguous 'no'. The OmniOS CE core team has no interest in maintaining an LTS release; any such extended support would have to come from someone else doing the work. The current OmniOS CE support plans are:

What we intend, is to support the current and previous release with an emphasis on the current release going forward from r151022.

OmniOS CE releases are planned to come out roughly every 26 weeks, ie every six months, so supporting the current and previous release means that you get a nominal year of security updates and so on (in practice less than a year).

I can't blame the OmniOS CE core team for this (and I'm not anything that I'd describe as 'disappointed'; getting not just a OmniOS CE but a OmniOS CE LTS was always a long shot). People work on what interest them, and the CE core team just doesn't use LTS releases or plan to. They're doing enough as it is to keep OmniOS alive. And for most people, upgrading from release to release is probably not a big deal.

In the short term, this means that we are not going to bother to try to upgrade from OmniOS r151014 to either the current or the next version of OmniOS CE, because the payoff of relatively temporary security support doesn't seem worth the effort. We've already been treating our fileservers as sealed appliances, so this is not something we consider a big change.

(The long term is beyond the scope of this entry.)

OmniOSCENoLTSVersion written at 01:09:13; Add Comment


Trying to understand the ZFS l2arc_noprefetch tunable

If you read the Illumos ZFS source code or perhaps some online guides to ZFS performance tuning, you may run across mention of a tunable called l2arc_noprefetch. There are various explanations of what this tunable means; for example, the current Illumos source code for arc.c says:

boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */

As you can see, this defaults to being turned on in Illumos (and in ZFS on Linux). You can find various tuning guides online that suggest turning it to off for better L2ARC performance, and when I had an L2ARC in my Linux system I ran this way for a while. One tuning guide I found describes it this way:

This tunable determines whether streaming data is cached or not. The default is not to cache streaming data. [...]

This makes things sound like you should absolutely turn this on, but not so fast. The ZFS on Linux manpage on these things describes it this way:

Do not write buffers to L2ARC if they were prefetched but not used by applications.

That sounds a lot more reasonable, and especially it sounds reasonable to have it turned on by default. ZFS prefetching can still be overly aggressive, and (I believe) it still doesn't slow itself down if the prefetched data is never actually read. If you are having prefetch misses, under normal circumstances you probably don't want those misses taking up L2ARC space; you'd rather have L2ARC space go to things that you actually did read.

As far as I can decode the current Illumos code, this description also seems to match the actual behavior. If a ARC header is flagged as a prefetch, it is marked as not eligible for the L2ARC; however, if a normal read is found in the ARC and the read is eligible for L2ARC, the found ARC header is then marked as eligible (in arc_read()). So if you prefetch then get a read hit, the ARC buffer is initially ineligible but becomes eligible.

If I'm reading the code correctly, l2arc_noprefetch also has a second, somewhat subtle effect on reads. If the L2ARC contains the block but ZFS is performing a prefetch, then the prefetch will not read the block from the L2ARC but will instead fall through to doing a real read. I'm not sure why this is done; it may be that it simplifies other parts of the code, or it may be a deliberate attempt to preserve L2ARC bandwidth for uses that are considered more likely to be productive. If you set l2arc_noprefetch to off, prefetches will read from the L2ARC and count as L2ARC hits, even if they are not actually used for a real read.

Note that this second subtle effect makes it hard to evaluate the true effects of turning off l2arc_noprefetch. I think you can't go from L2ARC hits alone because L2ARC hits could be inflated by prefetching, putting the prefetched data in L2ARC, throwing it away before it's used for a real read, re-prefetching the same data and getting it from L2ARC, then throwing it away again, still unused.

ZFSL2ARCNoprefetchTunable written at 02:06:57; Add Comment


I'm cautiously optimistic about the new OmniOS Community Edition

You may recall that back in April, OmniTI suspended active development of OmniOS, leaving its future in some doubt and causing me to wonder what we'd do about our current generation of fileservers. There was a certain amount of back and forth on the OmniOS mailing list, but in general nothing concrete happened about, say, updates to the current OmniOS release, and people started to get nervous. Then just over a week ago, the OmniOS Community Edition was announced, complete with OmniOSCE.org. Since then, they've already released one weekly update (r151022i) with various fixes.

All of this leaves me cautiously optimistic for our moderate term needs, where we basically need a replacement for OmniOS r151014 (the previous LTS release) that gets security updates. I'm optimistic for the obvious reason, which is that things are really happening here; I'm cautious because maintaining a distribution of anything is a bunch of work over time and it's easy to burn out people doing it. I'm hopeful that the initial people behind OmniOS CE will be able to get more people to help and spread the load out, making it more viable over the months to come.

(I won't be one of the people helping, for previously discussed reasons.)

We're probably not in a rush to try to update from r151014 to the OmniOS CE version of r151022. Building out a new version of OmniOS and testing it takes a bunch of effort, the process of deployment is disruptive, and there's probably no point in doing too much of that work until the moderate term situation with OmniOS CE is clearer. For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).

For our longer term needs, ie the next generation of fileservers, a lot of things are up in the air. If we move to smaller fileservers we will probably move to directly attached disks, which means we now care about SAS driver support, and in general there's been the big question of good Illumos support for 10G-T Ethernet hardware (which I believe is still not there today for Intel 10G-T cards, or at least I haven't really seen any big update to the ixgbe driver). What will happen with OmniOS CE over the longer term is merely one of the issues in play; it may turn out to be important, or it may turn out to be irrelevant because our decision is forced by other things.

OmniOSCECautiousOptimism written at 23:47:29; Add Comment


The difference between ZFS scrubs and resilvers

At one level, asking what the difference is between scrubs and resilvers sounds silly; resilvering is replacing disks, while scrubbing is checking disks. But if you look at the ZFS code in the kernel things become much less clear, because both scrubs and resilvers use a common set of code to do all their work and it's actually not at all easy to tell what happens differently between them. Since I have actually dug through this section of ZFS code just the other day, I want to write down what the differences are while I remember them.

Both scrubs and resilvers traverse all of the metadata in the pool (in a non-sequential order), and both wind up reading all data. However, scrubs do this more thoroughly for mirrored vdevs; scrubs read all sides of a mirror, while resilvers only read one copy of the data (well, one intact copy). On raidz vdevs there is no difference here, as both scrubs and resilvers read both the data blocks and the parity blocks. This implies that a scrub on mirrored vdevs does more IO and (importantly) more checking than a resilver does. After a resilver of mirrored vdevs, you know that you have at least one intact copy of every piece of the pool, while after an error-free scrub of mirrored vdevs, you know that all ZFS metadata and data on all disks is fully intact.

For resilvers but not scrubs (at least normally), ZFS will sort of attempt to write everything back to the disks again, as I covered in the sidebar of last entry. As far as I can tell from the code, ZFS always skips even trying to write things back to disks that are either known to have good data (for example they were read from) or that are believed to be good because their DTL says that the data is clean on them (I believe that this case only really applies for mirrored vdevs). For scrubs, the only time ZFS submits writes to the disks is if it detects actual errors.

(Although I don't entirely understand how you get into this situation, it appears that a scrub of a pool with a 'replacing' disk behaves a lot like a resilver as far as that disk is concerned. As you would expect and want, the scrub doesn't try to read from the new disk and, like resilvers, it tries to write everything back to the disk.)

Since we only have mirrored vdevs on our fileservers, what really matters to us here is the difference in what gets read between scrubs and resilvers. On the one hand, resilvers put less read load on the disks, which is good for reducing their impact. On the other hand, resilvering isn't quite as thorough a check of the pool's total health as a scrub is.

PS: I'm not sure if either a scrub or a resilver reads the ZIL. Based on some code in zil.c, I suspect that it's checked only during scrubs, which would make sense. Alternately, zil.c is reusing ZIO_FLAG_SCRUB for non-scrub IO for some of its side effects, but that would be a bit weird.

ZFSResilversVsScrubs written at 00:44:26; Add Comment


Resilvering multiple disks at once in a ZFS pool adds no real extra overhead

Suppose, not entirely hypothetically, that you have a multi-disk, multi-vdev ZFS pool (in our case using mirrored vdevs) and you need to migrate this pool from one set of disks to another set of disks. If you care about doing relatively little damage to pool performance for the duration of the resilvering, are you better off replacing one disk at a time (with repeated resilvers of the pool), or doing a whole bunch of disk replacements at once in one big resilver?

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

(This is unlike conventional RAID10, where replacing multiple drives at once will probably add additional read IO, which will affect array performance.)

For metadata, this is not particularly surprising. Since metadata is stored all across the pool, metadata located in one vdev can easily point to things located on another one. Given this, you have to read all metadata in the pool in order to even find out what is on a disk that's being resilvered. In theory I think that ZFS could optimize handling leaf data to skip stuff that's known to be entirely on unaffected vdevs; in practice, I can't find any sign in the code that it attempts this optimization for resilvers, and there are rational reasons to skip it anyway.

(As things are now, after a successful resilver you know that the pool has no permanent damage anywhere. If ZFS optimized resilvers by skipping reading and thus checking data on theoretically unaffected vdevs, you wouldn't have this assurance; you'd have to resilver and then scrub to know for sure. Resilvers are somewhat different from scrubs, but they're close.)

This doesn't mean that replacing multiple disks at once won't have any impact on your overall system, because your system may have overall IO capacity limits that are affected by adding more writes. For example, our ZFS fileservers each have a total write bandwidth of 200 Mbytes/sec across all their ZFS pool disks (since we have two 1G iSCSI networks). At least in theory we could saturate this total limit with resilver write traffic alone, and certainly enough resilver write traffic might delay normal user write traffic (especially at times of high write volume). Of course this is what ZFS scrub and resilver tunables are about, so maybe you want to keep an eye on that.

(This also ignores any potential multi-tenancy issues, which definitely affect us at least some of the time.)

ZFS does optimize resilvering disks that were only temporarily offline, using what ZFS calls a dirty time log. The DTL can be used to optimize walking the ZFS metadata tree in much the same way as how ZFS bookmarks work.

Sidebar: How ZFS resilvers (seem to) repair data, especially in RAIDZ

If you look at the resilvering related code in vdev_mirror.c and especially vdev_raidz.c, how things actually get repaired seems pretty mysterious. Especially in the RAIDZ case, ZFS appears to just go off and issue a bunch of ZFS writes to everything without paying any attention to new disks versus old, existing disks. It turns out that the important magic seems to be in zio_vdev_io_start in zio.c, where ZFS quietly discards resilver write IOs to disks unless the target disk is known to require it. The best detailed explanation is in the code's comments:

 * If this is a repair I/O, and there's no self-healing involved --
 * that is, we're just resilvering what we expect to resilver --
 * then don't do the I/O unless zio's txg is actually in vd's DTL.
 * This prevents spurious resilvering with nested replication.
 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
 * A is out of date, we'll read from C+D, then use the data to
 * resilver A+B -- but we don't actually want to resilver B, just A.
 * The top-level mirror has no way to know this, so instead we just
 * discard unnecessary repairs as we work our way down the vdev tree.
 * The same logic applies to any form of nested replication:
 * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.

It appears that preparing and queueing all of these IOs that will then be discarded does involve some amount of CPU, memory allocation for internal data structures, and so on. Presumably this is not a performance issue in practice, especially if you assume that resilvers are uncommon in the first place.

(And certainly the code is better structured this way, especially in the RAIDZ case.)

ZFSMultidiskResilversFree written at 22:29:35; Add Comment

(Previous 10 or go back to May 2017 at 2017/05/17)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.