Wandering Thoughts

2018-06-09

What ZFS messages about 'permanent errors in <0x95>:<0x0>' mean

If you use ZFS long enough (or are unlucky enough), one of the things you may run into are reports in zpool status -v of permanent errors in something (we've had that happen to us despite redundancy). If you're reasonably lucky, the error message will have a path in it. If you're unlucky, the error message will say something like:

errors: Permanent errors have been detected in the following files:
        <0x95>:<0x0>

This is a mysterious and frustrating message. On the ZFS on Linux mailing list, Richard Elling recently shared some extremely useful information about what they mean in this message.

The short answer of what they mean is, to quote directly:

The first number is the dataset id (index) and the second is the object id. For filesystems, the object id can be the same as the file's "inode" as shown by "ls -i" But a few obect ids exist for all datasets. Object id 0 is the DMU dnode.

The dataset here may be a ZFS filesystem, a snapshot, or I believe a few other things. I believe that if it's still in existence, you'll normally get at least its name and perhaps the full path to the object. When it's not in existence any more (perhaps you deleted the snapshot or the whole filesystem in question since the scrub detected it), you get this hex ID and there's also no information about the path.

The reason the information is presented this way is that what the ZFS code in the kernel saves and returns to the zpool command is actually just the dataset and object ID. It's up to zpool to turn both of these into names, which it actually does by calling back into the kernel to find out what they're currently called, if the kernel knows. Inspecting the relevant ZFS code says that there are five cases:

  • <metadata>:<0x...> means corruption in some object in the pool's overall metadata object set.

  • <0x...>:<0x...> means that the dataset involved can't be identified (and thus ZFS has no hope of identifying the thing inside the dataset).

  • /some/path/name means you have a corrupted filesystem object (a file, a directory, etc) in a currently mounted dataset and this is its full current path.

    (I think that ZFS's determination of the path name for a given ZFS object is pretty reliable; if I'm reading the code right, it appears to be able to scan upward in the filesystem hierarchy starting with the object itself.)

  • dsname:/some/path means that the dataset is called dsname but it's not currently mounted, and /some/path is the path within it. I think this happens for snapshots.

  • dsname:<0x...> means that it's in the given dataset dsname (which may or may not be mounted), but the ZFS object in question can't have its path identified for various reasons (including that it's already been deleted).

Only things in ZFS filesystems (and snapshots and so on) have path names, so an error in a ZVOL will always be reported without the path. I'm not sure what the reported dataset names are for ZVOLs, since I don't use ZVOLs.

The final detail is that you may see this error status in 'zpool status -v' even after you've cleaned it up. To quote Richard Elling again:

Finally, the error buffer for "zpool status" contains information for two scan passes: the current and previous scans. So it is possible to delete an object (eg file) and still see it listed in the error buffer. It takes two scans to completely update the error buffer. This is important if you go looking for a dataset+object tuple with zdb and don't find anything...

PS: There are some cases where <xattrdir> will appear in the file path. If I'm reading the code correctly, this happens when the problem is in an extended attribute instead of the filesystem object itself.

(See also this, this, and this.)

PPS: Richard Elling's message was on the ZFS on Linux mailing list and about an issue someone was having with a ZoL system, but as far as I can see the core code is basically the same in Illumos and I would expect in FreeBSD as well, so this bit of ZFS wisdom should be cross-platform.

ZFSPermanentErrorsMeaning written at 22:58:26; Add Comment

2018-05-27

ZFS pushes file renamings and other metadata changes to disk quite promptly

One of the general open questions on Unix is when changes like renaming or creating files are actually durably on disk. Famously, some filesystems on some Unixes have been willing to delay this for an unpredictable amount of time unless you did things like fsync() the containing directory of your renamed file, not just fsync() the file itself. As it happens, ZFS's design means that it offers some surprisingly strong guarantees about this; specifically, ZFS persists all metadata changes to disk no later than the next transaction group commit. In ZFS today, a transaction group commit generally happens every five seconds, so if you do something like rename a file, your rename will be fully durable quite soon even if you do nothing special.

However, this doesn't mean that if you create a file, write data to the file, and then rename it (with no other special operations) that in five or ten seconds your new file is guaranteed to be present under its new name with all the data you wrote. Although metadata operations like creating and renaming files go to ZFS right away and then become part of the next txg commit, the kernel generally holds on to written file data for a while before pushing it out. You need some sort of fsync() in there to force the kernel to commit your data, not just your file creation and renaming. Because of how the ZFS intent log works, you don't need to do anything more than fsync() your file here; when you fsync() a file, all pending metadata changes are flushed out to disk along with the file data.

(In a 'create new version, write, rename to overwrite current version' setup, I think you want to fsync() the file twice, once after the write and then once after the rename. Otherwise you haven't necessarily forced the rename itself to be written out. You don't want to do the rename before a fsync(), because then I think that a crash at just the wrong time could give you an empty new file. But the ice is thin here in portable code, including code that wants to be portable to different filesystem types.)

My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.

ZFSWhenMetadataSynced written at 22:47:51; Add Comment

2018-05-18

ZFS spare-N spare vdevs in your pool are mirror vdevs

Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:

[...]
   wwn-0x5000cca251b79b98    ONLINE  0  0  0
   spare-8                   ONLINE  0  0  0
     wwn-0x5000cca251c7b9d8  ONLINE  0  0  0
     wwn-0x5000cca2568314fc  ONLINE  0  0  0
   wwn-0x5000cca251ca10b0    ONLINE  0  0  0
[...]

What is this spare-8 thing, beyond 'a sign that a spare activated here'? This is sometimes called a 'spare vdev', and the answer is that spare vdevs are mirror vdevs.

Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.

(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)

What this means is that these spare-N vdevs behave like mirror vdevs. Assuming that both sides are healthy, reads can be satisfied from either side (and will be balanced back and forth as they are for mirror vdevs), writes will go to both sides, and a scrub will check both sides. As a result, if you scrub a pool with a spare-N vdev and there are no problems reported for either component device, then both old and new device are fine and contain a full and intact copy of the data. You can keep either (or both).

As a side note, it's possible to manually create your own spare-N vdevs even without a fault, because spares activation is actually a user-level thing in ZFS. Although I haven't tested this recently, you generally get a spare-N vdev if you do 'zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK> is configured as a spare in the pool. Abusing this to create long term mirrors inside raidZ vdevs is left as an exercise to the reader.

(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)

PS: As you might expect, the replacing-N vdev that you get when you replace a disk is also a mirror vdev, with the special behavior than when the resilver finishes, the original device is normally automatically detached.

ZFSSparesAreMirrors written at 22:44:19; Add Comment

2018-05-02

An interaction of low ZFS recordsize, compression, and advanced format disks

Suppose that you have something with a low ZFS recordsize; a classical example is zvols, where people often use an 8 Kb volblocksize. You have compression turned on, and you are using a pool (or vdev) with ashift=12 because it's on 'advanced format' drives or you're preparing for that possibility. This seems especially likely on SSDs, some of which are already claiming to be 4K physical sector drives.

In this situation, you will probably get much lower compression ratios than you expect, even with reasonably compressible data. There are two reasons for this, the obvious one and the inobvious one. The obvious one is that ZFS compresses each logical block separately, and your logical blocks are small. Generally the larger the things you compress at once, the better most compression algorithms do, up to a reasonable size; if you use a small size, you get not as good results and less compression.

(The lz4 command line compression program doesn't even have an option to compress in less than 64 Kb blocks (cf), which shows you what people think of the idea. The lz4 algorithm can be applied to smaller blocks, and ZFS does, but presumably the results are not as good.)

The inobvious problem is how a small recordsize interacts with a large physical block size (ie, a large ashift). In order to save any space on disk, compression has to shrink the data enough so that it uses fewer disk blocks. With 4 Kb disk blocks (an ashift of 12), this means you need to compress things down by at least 4 Kb; when you're starting with 8 Kb logical blocks because of your 8 Kb recordsize, this means you need at least 50% compression in order to save any space at all. If your data is compressible but not that compressible, you can't save any allocated space.

A larger recordsize gives you more room to at least save some space. With a 128 Kb recordsize, you need only compress a bit (to 120 Kb, about 7% compression) in order to save one 4 Kb disk block. Further increases in compression can get you more savings, bit by bit, because you have more disk blocks to shave away.

(An ashift=9 pool similarly gives you more room to get wins from compression because you can save space in 512 byte increments, instead of needing to come up with 4 Kb of space savings at a time.)

(Writing this up as an entry was sparked by this ZFS lobste.rs discussion.)

PS: I believe that this implies that if your recordsize (or volblocksize) is the same as the disk physical block size (or ashift size), compression will never do anything for you. I'm not sure if ZFS will even try to run the compression code or if it will silently pretend that you have compression=off set.

ZFSRecordsizeAndCompression written at 01:27:30; Add Comment

2018-04-23

ZFS's recordsize as an honest way of keeping checksum overhead down

One of the classical tradeoffs of using checksums to verify the integrity of something (as ZFS does) is the choice of how large a chunk of data to cover with a single checksum. A large chunk size keeps the checksum overhead down, but it means that you have to process a large amount of data at once in order to verify or create the checksum. A large size also limits how specific you can be about what piece of data is damaged, which is important if you want to be able to recover some of your data.

(Recovery has two aspects. One is simply giving you access to as much of the undamaged data as possible. The other is how much data you have to process in order to heal corrupted data using various redundancy schemes. If you checksum over 16 Kbyte chunks and you have a single corrupted byte in a 1 Mbyte file, you can read 1008 Kbytes immediately and you only have to process the span of 16 Kbytes of data to recover from the corruption. If you checksum over 1 Mbyte chunks and have the same corruption, the entire file is unreadable and you're processing the span of 1 Mbyte of data to recover.)

If you're serious about checksums, you have to verify them on read and always create and update them on writes. This means that you have to operate on the entire checksum chunk size for these operations (even on partial chunk updates, depending on the checksum algorithm). Regardless of how the data is stored on disk, you to have all of the chunk available in memory to compute and recompute the checksum. So if you want to have a relatively large checksum chunk size in order to keep overhead down, you might as well make this your filesystem block size, because you're forced to do a lot of IO in the checksum chunk size no matter what.

This is effectively what ZFS does for files that have grown to their full recordsize. The checksum chunk size is recordsize and so is the filesystem (logical) block size; ZFS stores one checksum for every recordsize chunk of the file (well, that actually exists). This keeps the overhead of checksums down nicely, and setting the logical filesystem block size to the checksum chunk size is honest about what IO is actually happening (especially in a copy on write filesystem).

If the ZFS logical block size was always recordsize, this could be a serious problem for small files. Ignoring compression, they would allocate far more space than they needed, creating huge amounts of inefficiency (you could have a 4 Kbyte file that had to allocate 128 Kbytes of disk space). So instead ZFS has what is in effect a variable checksum chunk size for small files, and with it a variable logical block size, in order to store such files reasonably efficiently. As we've seen, ZFS works fairly hard to only store the minimum amount of data it has to for small files (which it defines as files below recordsize).

(This model of why ZFS recordsize exists and operates the way it does didn't occur to me until I wrote yesterday's entry, but now that it has, I think I may finally have the whole thing sorted out in my head.)

ZFSRecordsizeAndChecksums written at 02:08:13; Add Comment

2018-04-22

Thinking about why ZFS only does IO in recordsize blocks, even random IO

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a file, both for reads and especially for writes. Since the default recordsize is 128 Kb, this means that many files of interest are have recordsize blocks and thus all IO to them is done in 128 Kb units, even if you're only reading or writing a small amount of data.

On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.

I wrote before about the ZFS DVA, which is ZFS's equivalent of a block number and tells you where to find data. ZFS DVAs are embedded into 'block pointers', which you can find described in spa.h. One of the fields of the block pointer is the ZFS block checksum. Since this is part of the block pointer, it is a checksum over all of the (logical) data in the block, which is up to recordsize. Once a file reaches recordsize bytes long, all blocks are the same size, the recordsize.

Since the ZFS checksum is over the entire logical block, ZFS has to fetch the entire logical block in order to verify the checksum on reads, even if you're only asking for 4 Kbytes out of it. For writes, even if ZFS allowed you to have different sized logical blocks in a file, you'd need to have the original recordsize block available in order to split it and you'd have to write all of it back out (both because ZFS never overwrites in place and because the split creates new logical blocks, which need new checksums). Since you need to add new logical blocks, you might have a ripple effect in ZFS's equivalent of indirect blocks, where they must expand and shuffle things around.

(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)

In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.

(The one thing that might make ZFS nicer in the face of some access patterns where this matters is the ability to set the recordsize on a per-file basis instead of just a per-filesystem basis. But I'm not sure how important this would be; the kind of environments where it really matters are probably already doing things like putting database tables on their own filesystems anyway.)

PS: This feels like an obvious thing once I've written this entry all the way through, but the ZFS recordsize issue has been one of my awkward spots for years, where I didn't really understand why it all made sense and had to be the way it was.

PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.

ZFSWhyIOInRecordsize written at 01:07:47; Add Comment

2018-03-19

Some exciting ZFS features that are in OmniOS CE's (near) future

I recently wrote about how much better ZFS pool recovery is coming, which reported on Pavel Zakharov's Turbocharging ZFS Data Recovery. In that, Zakharov said that the first OS to get it would likely be OmniOS CE, although he didn't have a timeline. Since I just did some research on this, let's run down some exciting ZFS features that are almost certainly in OmniOS CE's near future, and where they are.

There are two big ZFS features from Delphix that have recently landed in the main Illumos tree, and a third somewhat smaller one:

  • This better ZFS pool recovery, which landed as a series of changes culminating in issue 9075 in February or so. Although I can't be sure, I believe that a recovered pool is fully compatible with older ZFS versions, although for major damage you're going to be copying data out of the pool to a new one.

  • The long awaited feature of shrinking ZFS pools by removing vdevs, which landed as issue 7614 in January. Using this will add a permanent feature flag to your pool that makes it fully incompatible with older ZFS versions.

  • A feature for checkpointing the overall ZFS pool state before you do potentially dangerous operations and can then rewind to them, issue 9166, which landed just a few days ago. Since one of the purposes of better ZFS pool recovery is to provide (better) recovery over pool configuration changes, I suspect that this new pool checkpointing helps out with it. This makes your pool relatively incompatible with older ZFS versions while a checkpoint exist.

(Apparently Delphix was only able to push these upstream from their own code base to Illumos and OpenZFS relatively recently.)

All of these features have been pulled into the 'master' branch of the OmniOS CE repository from the main Illumos repo where they landed. Unless something unusual happens, I would expect them all to be included in the next release of OmniOS CE, which their release schedule says is to be r151026, expected some time this May. This will not be an LTS release; if you want to wait for an LTS release to have these features, you're waiting until next year. Given the likely magnitude of these changes and the relatively near future release of r151026, I wouldn't expect OmniOS CE to include these in a future update to the current r151024 or especially r151022 LTS.

Since OmniOS CE Bloody integrates kernel and user updates on a regular basis, I suspect that it already has many of these features and will pick up the most recent one very soon. If this is so, it gives OmniOS people a clear path if they need to recover a damaged pool; you can boot a Bloody install media or otherwise temporarily run it, repair or import the pool, possibly copying the data to another pool, and then probably revert back to running your current OmniOS with the recovered pool.

Sidebar: How to determine this sort of stuff

The most convenient way is to look at the the git log for commits that involve, say, usr/src/uts/common/fs/zfs, the kernel ZFS code, in the OmniOS CE repo. In the Github interface, this is drilling down to that directory and then picking the 'History' option; a convenient link for this for the 'master' branch is here.

Each OmniOS CE release gets its own branch in the repo, named in the obvious way, and each branch thus has its own commit history for ZFS. Here is the version for r151024. Usefully, Github forms the URLs for these things in a very predictable way, making it very easy to hand-write your own URLs for specific things (eg, the same for r151022 LTS, which shows only a few recent ZFS changes).

There's no branch for OmniOS CE Bloody, so I believe that it's simply built from the 'master' branch. It is the bleeding edge version, after all.

ZFSOmniosCEComingChanges written at 01:24:50; Add Comment

2018-03-17

Much better ZFS pool recovery is coming (in open source ZFS)

One of the long standing issues in ZFS has been that while it's usually very resilient, it can also be very fragile if the wrong things get damaged. Classically, ZFS has had two modes of operation; either it would repair any damage or it would completely explode. There was no middle ground of error recovery, and this isn't a great experience; as I wrote once, panicing the system is not an error recovery strategy. In early versions of ZFS there was no recovery at all (you restored from backups); in later versions, ZFS added a feature where you could attempt to recover from damaged metadata by rewinding time, which was better than nothing but not a complete fix.

The good news is that that's going to change, and probably not too long from now. What you want to read about this is Turbocharging ZFS Data Recovery, by Pavel Zakharov of Delphix, which covers a bunch of work that he's done to make ZFS more resilient and more capable of importing various sorts of damaged pools. Of particular interest is the ability to likely recover at least some data from a pool that's lost an entire vdev. You can't get everything back, obviously, but ZFS metadata is usually replicated on multiple vdevs so losing a single vdev will hopefully leave you with enough left to at least get the rest of the data out of the pool.

(I saw this article via, itself via a retweet by @richardelling.)

All of this is really great news. ZFS has long needed better options for recovery from various pool problems, as well as better diagnostics for failed pool imports, and I'm quite happy that the situation is finally going to be improving.

The article is also interesting for its discussion of the current low level issues involved in pool importing. For example, until I read it I had no idea about how potentially dangerous a ZFS pool vdev change was due to how pool configurations are handled during the import process. I'd love to read more details on how pool importing really works and what the issues are (it's a long standing interest of mine), but sadly I suspect that no one with that depth of ZFS expertise has the kind of time it would take to write such an article.

As far as the timing of these features being available in your ZFS-using OS of choice goes, his article says this:

As of March 2018, it has landed on OpenZFS and Illumos but not yet on FreeBSD and Linux, where I’d expect it to be upstreamed in the next few months. The first OS that will get this feature will probably be OmniOS Community Edition, although I do not have an exact timeline.

If you have a sufficiently important damaged pool, under some circumstances it may be good enough if there is some OS, any OS, that can bring up the pool to recover the data in it. For all that I've had my issues with OmniOS's hardware support, OmniOS CE does have fairly decent hardware support and you can probably get it going on most modern hardware in an emergency.

(And if OmniOS can't talk directly to your disk hardware, there's always iSCSI, as we can testify. There's probably also other options for remote disk access that OmniOS ZFS and zdb can deal with.)

PS: If you're considering doing this in the future and your normal OS is something other than Illumos, you might want to pay attention to the ZFS feature flags you allow to be set on your pool, since this won't necessarily work if your pool uses features that OmniOS CE doesn't (yet) support. This is probably not going to be an issue for FreeBSD but might be an issue for ZFS on Linux. You probably want to compare the ZoL manpage on ZFS pool features with the Illumos version or even the OmniOS CE version.

Sidebar: Current ZFS pool feature differences

The latest Illumos tree has three new ZFS pool features from Delphix: device_removal, obsolete_counts (which enhances device removal), and zpool_checkpoint. These are all fairly recent additions; they appear to have landed in the Illumos tree this January and just recently, although the commits that implement them are dated from 2016.

ZFS on Linux has four new pool features: large_dnode, project_quota, userobj_accounting, and encryption. Both large dnodes and encryption have to be turned on explicitly, and the other two are read-only compatible, so in theory OmniOS can bring a pool up read-only even with them enabled (and you're going to want to have the pool read-only anyway).

ZFSPoolRecoveryComing written at 00:54:35; Add Comment

2018-02-14

Some things about ZFS block allocation and ZFS (file) record sizes

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. For a file under the recordsize, the block size turns out to be in a multiple of 512 bytes, regardless of the pool's ashift or the physical sector size of the drives the pool is using.

Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.

To talk about these sizes, I'll start with some illustrative zdb output for a file data block, as before:

 0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The first size of the three is the logical block size, before compression. This is the first size= number ('4200L' here, in hex and L for logical). This is what grows in 512-byte units up to the recordsize and so on.

The second size is the physical size after compression, if any; this is the second size= number ('4200P' here, P for physical). It's a bit weird. If the file can't be compressed, it is the same as the logical size and because the logical size goes in 512-byte units, so does this size, even on ashift=12 pools. However, if compression happens this size appears to go by the ashift, which means it doesn't necessarily go in 512-byte units. On an ashift=9 pool you'll see it go in 512-byte units (so you can have a compressed size of '400P', ie 1 KB), but the same data written in an ashift=12 pool winds up being in 4 Kb units (so you wind up with a compressed size of '1000P', ie 4 Kb).

The third size is the actual allocated size on disk, as recorded in the DVA's asize field (which is the third subfield in the DVA[0] portion). This is always in ashift-based units, even if the physical size is not. Thus you can wind up with a 20 KB DVA but a 16.5 KB 'physical' size, as in our example (the DVA is '5000' while the block physical size is '4200').

(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)

For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.

On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.

PS: I don't know if it's possible to mix vdevs with different ashifts in the same pool. If it is, I don't know how ZFS would decide what ashift to use for the physical block size. The minimum ashift in any vdev? The maximum ashift?

(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)

ZFSLogicalVsPhysicalBlockSizes written at 00:49:29; Add Comment

2018-02-04

A surprise in how ZFS grows a file's record size (at least for me)

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. If a file has more than one block, all blocks are recordsize, no more and no less. If a file is a single block, the size of this block is based on how much data has been written to the file (or technically the maximum offset that's been written to the file). However, how the block size grows as you write data to the file turns out to be somewhat surprising (which makes me very glad that I actually did some experiments to verify what I thought I knew before I wrote this entry, because I was very wrong).

Rather than involving the ashift or growing in powers of two, ZFS always grows the (logical) block size in 512-byte chunks until it reaches the filesystem recordsize. The actual physical space allocated on disk is in ashift sized units, as you'd expect, but this is not directly related to the (logical) block size used at the file level. For example, here is a 16896 byte file (of incompressible data) on an ashift=12 pool:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
4780566    1   128K  16.5K    20K     512  16.5K  100.00  ZFS plain file
[...]
0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).

In thinking about it, this makes a certain amount of sense because the ashift is really a vdev property, not a pool property, and can vary from vdev to vdev within a single pool. As a result, the actual allocated size of a given block may vary from vdev to vdev (and a block may be written to multiple vdevs if you have copies set to more than 1 or it's metadata). The file's current block size thus can't be based on the ashift, because ZFS doesn't necessarily have a single ashift to base it on; instead ZFS bases it on 512-byte sectors, even if this has to be materialized differently on different vdevs.

Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:

; dd if=/dev/zero of=testfile bs=128k count=1
[...]
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956361    1   128K   128K      0     512   128K    0.00  ZFS plain file
[...]

This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956029    1   128K   128K     1K     512   128K  100.00  ZFS plain file
[...]
0 L0 DVA[0]=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]

We have a compressed size of 1 Kb (and a 1 Kb allocation on disk, as this is an ashift=9 vdev), but once again the file block size is 128 Kb.

(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)

What this means is that ZFS has much less wasted space than I thought it did for files that are under the recordsize. Since such files grow their logical block size in 512-byte chunks, even with no compression they waste at most almost all of one physical block on disk (if you have a file that is, say, 32 Kb plus one byte, you'll have a physical block on disk with only one byte used). This has some implications for other areas of ZFS, but those are for another entry.

(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)

ZFSRecordsizeGrowth written at 22:28:55; Add Comment

(Previous 10 or go back to January 2018 at 2018/01/06)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.