Wandering Thoughts archives

2018-05-18

ZFS spare-N spare vdevs in your pool are mirror vdevs

Here's something that comes up every so often in ZFS and is not as well publicized as perhaps it should be (I most recently saw it here). Suppose that you have a pool, there's been an issue with one of the drives, and you've had a spare activate. In some situations, you'll wind up with a pool configuration that may look like this:

[...]
   wwn-0x5000cca251b79b98    ONLINE  0  0  0
   spare-8                   ONLINE  0  0  0
     wwn-0x5000cca251c7b9d8  ONLINE  0  0  0
     wwn-0x5000cca2568314fc  ONLINE  0  0  0
   wwn-0x5000cca251ca10b0    ONLINE  0  0  0
[...]

What is this spare-8 thing, beyond 'a sign that a spare activated here'? This is sometimes called a 'spare vdev', and the answer is that spare vdevs are mirror vdevs.

Yes, I know, ZFS says that you can't put one vdev inside another vdev and these spare-N vdevs are inside other vdevs. ZFS is not exactly wrong, since it doesn't let you and me do this, but ZFS itself can break its own rules and it's doing so here. These really are mirror vdevs under the surface and as you'd expect they're implemented with exactly the same code in the ZFS kernel code.

(If you're being sufficiently technical these are actually a slightly different type of mirror vdev, which you can see being defined in vdev_mirror.c. But while they have different nominal types they run the same code to do various operations. Admittedly, there are some other sections in the ZFS code that check to see whether they're operating on a real mirror vdev or a spare vdev.)

What this means is that these spare-N vdevs behave like mirror vdevs. Assuming that both sides are healthy, reads can be satisfied from either side (and will be balanced back and forth as they are for mirror vdevs), writes will go to both sides, and a scrub will check both sides. As a result, if you scrub a pool with a spare-N vdev and there are no problems reported for either component device, then both old and new device are fine and contain a full and intact copy of the data. You can keep either (or both).

As a side note, it's possible to manually create your own spare-N vdevs even without a fault, because spares activation is actually a user-level thing in ZFS. Although I haven't tested this recently, you generally get a spare-N vdev if you do 'zpool replace <POOL> <ACTIVE-DISK> <NEW-DISK>' and <NEW-DISK> is configured as a spare in the pool. Abusing this to create long term mirrors inside raidZ vdevs is left as an exercise to the reader.

(One possible reason to have a relatively long term mirror inside a raidZ vdev is if you don't entirely trust one disk but don't want to pull it immediately, and also have a handy spare disk. Here you're effectively pre-deploying a spare in case the first disk explodes on you. You could also do the same if you don't entirely trust the new disk and want to run it in parallel before pulling the old one.)

PS: As you might expect, the replacing-N vdev that you get when you replace a disk is also a mirror vdev, with the special behavior than when the resilver finishes, the original device is normally automatically detached.

solaris/ZFSSparesAreMirrors written at 22:44:19; Add Comment

How I usually divide up NFS (operation) metrics

When you're trying to generate metrics for local disk IO, life is generally relatively simple. Everyone knows that you usually want to track reads separately from writes, especially these days when they may have significantly different performance characteristics on SSDs. While there are sometimes additional operations issued to physical disks, they're generally not important. If you have access to OS-level information it can be useful to split your reads and writes into synchronous versus asynchronous ones.

Life with NFS is not so simple. NFS has (data) read and write operations, like disks do, but it also has a large collection of additional protocol operations that do various things (although some of these protocol operations are strongly related to data writes, for example the COMMIT operation, and should probably be counted as data writes in some way). If you're generating NFS statistics, how do you want to break up or aggregate these other operations?

One surprisingly popular option is to ignore all of them on the grounds that they're obviously unimportant. My view is that this is a mistake in general, because these NFS operations can have an IO impact on the NFS server and create delays on the NFS clients if they're not satisfied fast enough. But if we want to say something about these and we don't want to go to the extreme of presenting per-operation statistics (which is probably too much information, and in any case can hide patterns in noise), we need some sort of breakdown.

The breakdown that I generally use is to split up NFS operations into four categories: data reads, data writes (including COMMIT), operations that cause metadata writes such as MKDIR and REMOVE, and all other operations (which are generally metadata reads, for example READDIRPLUS and GETATTR). This split is not perfect, partly because some metadata read operations are far more common (and are far more cached on the server) than other operations; specifically, GETATTR and ACCESS are often the backbone of a lot of NFS activity, and it's common to see GETATTR as by far the most common single operation.

(I'm also not entirely convinced that this is the right split; as with other metrics wrestling, it may just be a convenient one that feels logical.)

Sidebar: Why this comes up less with local filesystems and devices

If what you care about is the impact that IO load is having on the system (and how much IO load there is), you don't entirely care why an IO request was issued, you only care that it was. From the disk drive's perspective, a 16 KB read is a 16 KB read, and it takes as much work to satisfy a 16 KB file as it does a 16 KB directory or a free space map. This doesn't work for NFS because NFS is more abstracted, and both the amount of operations and the amount of bytes that flow over the wire don't necessarily give you a true picture of the impact on the server.

Of course, in these days of SSDs and complicated disk systems, just having IO read and write information may not be giving you a true picture either. With SSDs especially, we know that bursts of writes are different from sustained writes, that writing to a full disk is often different than writing to an empty one, and apparently giving drives some idle time to do background processing and literally cool down may change their performance. But many things are simplifications so we do the best we can.

(Actual read and write performance is a 'true picture' in one sense, in that it is giving you information about what results the OS is getting from the drive. But it doesn't necessarily help to tell you why, or what you can do to improve the situation.)

tech/NFSMyMetricsSplit written at 01:44:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.