Wandering Thoughts archives

2018-05-27

ZFS pushes file renamings and other metadata changes to disk quite promptly

One of the general open questions on Unix is when changes like renaming or creating files are actually durably on disk. Famously, some filesystems on some Unixes have been willing to delay this for an unpredictable amount of time unless you did things like fsync() the containing directory of your renamed file, not just fsync() the file itself. As it happens, ZFS's design means that it offers some surprisingly strong guarantees about this; specifically, ZFS persists all metadata changes to disk no later than the next transaction group commit. In ZFS today, a transaction group commit generally happens every five seconds, so if you do something like rename a file, your rename will be fully durable quite soon even if you do nothing special.

However, this doesn't mean that if you create a file, write data to the file, and then rename it (with no other special operations) that in five or ten seconds your new file is guaranteed to be present under its new name with all the data you wrote. Although metadata operations like creating and renaming files go to ZFS right away and then become part of the next txg commit, the kernel generally holds on to written file data for a while before pushing it out. You need some sort of fsync() in there to force the kernel to commit your data, not just your file creation and renaming. Because of how the ZFS intent log works, you don't need to do anything more than fsync() your file here; when you fsync() a file, all pending metadata changes are flushed out to disk along with the file data.

(In a 'create new version, write, rename to overwrite current version' setup, I think you want to fsync() the file twice, once after the write and then once after the rename. Otherwise you haven't necessarily forced the rename itself to be written out. You don't want to do the rename before a fsync(), because then I think that a crash at just the wrong time could give you an empty new file. But the ice is thin here in portable code, including code that wants to be portable to different filesystem types.)

My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.

ZFSWhenMetadataSynced written at 22:47:51; Add Comment

2018-05-18

ZFS spare-N spare vdevs in your pool are mirror vdevs

Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:

[...]
   wwn-0x5000cca251b79b98    ONLINE  0  0  0
   spare-8                   ONLINE  0  0  0
     wwn-0x5000cca251c7b9d8  ONLINE  0  0  0
     wwn-0x5000cca2568314fc  ONLINE  0  0  0
   wwn-0x5000cca251ca10b0    ONLINE  0  0  0
[...]

What is this spare-8 thing, beyond 'a sign that a spare activated here'? This is sometimes called a 'spare vdev', and the answer is that spare vdevs are mirror vdevs.

Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.

(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)

What this means is that these spare-N vdevs behave like mirror vdevs. Assuming that both sides are healthy, reads can be satisfied from either side (and will be balanced back and forth as they are for mirror vdevs), writes will go to both sides, and a scrub will check both sides. As a result, if you scrub a pool with a spare-N vdev and there are no problems reported for either component device, then both old and new device are fine and contain a full and intact copy of the data. You can keep either (or both).

As a side note, it's possible to manually create your own spare-N vdevs even without a fault, because spares activation is actually a user-level thing in ZFS. Although I haven't tested this recently, you generally get a spare-N vdev if you do 'zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK> is configured as a spare in the pool. Abusing this to create long term mirrors inside raidZ vdevs is left as an exercise to the reader.

(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)

PS: As you might expect, the replacing-N vdev that you get when you replace a disk is also a mirror vdev, with the special behavior than when the resilver finishes, the original device is normally automatically detached.

ZFSSparesAreMirrors written at 22:44:19; Add Comment

2018-05-02

An interaction of low ZFS recordsize, compression, and advanced format disks

Suppose that you have something with a low ZFS recordsize; a classical example is zvols, where people often use an 8 Kb volblocksize. You have compression turned on, and you are using a pool (or vdev) with ashift=12 because it's on 'advanced format' drives or you're preparing for that possibility. This seems especially likely on SSDs, some of which are already claiming to be 4K physical sector drives.

In this situation, you will probably get much lower compression ratios than you expect, even with reasonably compressible data. There are two reasons for this, the obvious one and the inobvious one. The obvious one is that ZFS compresses each logical block separately, and your logical blocks are small. Generally the larger the things you compress at once, the better most compression algorithms do, up to a reasonable size; if you use a small size, you get not as good results and less compression.

(The lz4 command line compression program doesn't even have an option to compress in less than 64 Kb blocks (cf), which shows you what people think of the idea. The lz4 algorithm can be applied to smaller blocks, and ZFS does, but presumably the results are not as good.)

The inobvious problem is how a small recordsize interacts with a large physical block size (ie, a large ashift). In order to save any space on disk, compression has to shrink the data enough so that it uses fewer disk blocks. With 4 Kb disk blocks (an ashift of 12), this means you need to compress things down by at least 4 Kb; when you're starting with 8 Kb logical blocks because of your 8 Kb recordsize, this means you need at least 50% compression in order to save any space at all. If your data is compressible but not that compressible, you can't save any allocated space.

A larger recordsize gives you more room to at least save some space. With a 128 Kb recordsize, you need only compress a bit (to 120 Kb, about 7% compression) in order to save one 4 Kb disk block. Further increases in compression can get you more savings, bit by bit, because you have more disk blocks to shave away.

(An ashift=9 pool similarly gives you more room to get wins from compression because you can save space in 512 byte increments, instead of needing to come up with 4 Kb of space savings at a time.)

(Writing this up as an entry was sparked by this ZFS lobste.rs discussion.)

PS: I believe that this implies that if your recordsize (or volblocksize) is the same as the disk physical block size (or ashift size), compression will never do anything for you. I'm not sure if ZFS will even try to run the compression code or if it will silently pretend that you have compression=off set.

ZFSRecordsizeAndCompression written at 01:27:30; Add Comment

By day for May 2018: 2 18 27; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.