Wandering Thoughts archives


Doing frequent ZFS scrubs lets you discover problems close to when they happened

Somewhat recently, the ZFS on Linux mailing list had a discussion of how frequently you should do ZFS scrubs, with a number of people suggesting that modern drives only really need relatively infrequent scrubs. As I was reading through the thread as part of trying to catch up on the list, it struck me that there is a decent reason for scrubbing frequently despite this. If we assume that scrubs surface existing problems that had previously been silent (instead of creating new ones), doing frequent scrubs lowers the mean time before you detect such problems.

Lowering the mean time to detection has the same advantage it does in programming (with things like unit tests), which is that it significantly narrows down when the underlying problem could have happened. If you scrub data once a month and you find a problem in a scrub, the problem could have really happened any time in the past month; if you scrub every week and find a problem, you know it happened in the past week. Relatedly, the sooner you detect that a problem happened in the recent past, the more likely you are to still have logs, traces, metrics, and other information that might let you look for anomalies and find a potential cause (beyond 'the drive had a glitch', because that's not always the problem).

In a modern ZFS environment with sequential scrubs (or just SSDs), scrubs are generally fast and low impact (although it depends on your IO load), so the impact of doing them every week for all of your data is probably low. I try to scrub the pools on my personal machines every week, and I generally don't notice. Now that I'm thinking about scrubs this way, I'm going to try to be more consistent about weekly scrubs.

(Our fileservers scrub each pool once every four weeks on a rotating basis. We could lower this, even down to once a week, but despite what I've written here I suspect that we're not going to bother. We don't see checksum errors or other problems very often, and we probably aren't going to do deep investigation of anything that turns up. If we can trace a problem to a disk IO error or correlate it with an obvious and alarming SMART metric, we're likely to replace the disk; otherwise, we're likely to clear the error and see if it comes back.)

ZFSFrequentScrubsBenefit written at 23:25:33; Add Comment


What we do to enable us to grow our ZFS pools over time

In my entry on why ZFS isn't good at growing and reshaping pools, I mentioned that we go to quite some lengths in our ZFS environment to be able to incrementally expand our pools. Today I want to put together all of the pieces of that in one place to discuss what those lengths are.

Our big constraint is that not only do we need to add space to pools over time, but we have a fairly large number of pools and which pools will have space added to them is unpredictable. We need a solution to pool expansion that leaves us with as much flexibility as possible for as long as possible. This pretty much requires being able to expand pools in relatively small increments of space.

The first thing we do, or rather don't do, is that we don't use raidz. Raidz is potentially attractive on SSDs (where the raidz read issue has much less impact), but since you can't expand a raidz vdev, the minimum expansion for a pool using raidz vdevs is at least three or four separate 'disks' to make a new raidz vdev (and in practice you'd normally want to use more than that to reduce the raidz overhead, because a four disk raidz2 vdev is basically a pair of mirrors with slightly more redundancy but more awkward management and some overheads). This requires adding relatively large blocks of space at once, which isn't feasible for us. So we have to do ZFS mirroring instead of the more space efficient raidz.

(A raidz2 vdev is also potentially more resilient than a bunch of mirror vdevs, because you can lose any arbitrary two disks without losing the pool.)

However, plain mirroring of whole disks would still not work for us because that would mean growing pools by relatively large amounts of space at a time (and strongly limit how many pools we can put on a single fileserver). To enable growing pools by smaller increments of space than a whole disk, we partition all of our disks into smaller chunks, currently four chunks on a 2 TB disk, and then do ZFS mirror vdevs using chunks instead of whole disks. This is not how you're normally supposed to set up ZFS pools, and on our older fileservers using HDs over iSCSI it caused visible performance problems if a pool ever used two chunks from the same physical disk. Fortunately those seem to be gone on our new SSD-based fileservers.

Even with all of this we can't necessarily let people expand existing pools by a lot of space, because the fileserver their pool is on may not have enough free space left (especially if we want other pools on that fileserver to still be able to expand). When people buy enough space at once, we generally wind up starting another ZFS pool on a different fileserver, which somewhat cuts against the space flexibility that ZFS offers. People may not have to decide up front how much space they want their filesystems to have, but they may have to figure out which pool a new filesystem should go into and then balance usage across all of their pools (or have us move filesystems).

(Another thing we do is that we sell pool space to people in 1 GB increments, although usually they buy more at once. This is implemented using a pool quota, and of course that means that we don't even necessarily have to grow the pool's space when people buy space; we can just increase the quota.)

Although we can grow pools relatively readily (when we need to), we still have the issue that adding a new vdev to a ZFS pool doesn't rebalance space usage across all of the pool's vdevs; it just mostly writes new data to the new vdev. In a SSD world where seeks are essentially free and we're unlikely to saturate the SSD's data transfer rates on any regular basis, this imbalance probably doesn't matter too much. It does make me wonder if nearly full pool vdevs interact badly with ZFS's issues with coming near quota limits (and a followup).

ZFSHowWeGrowPools written at 23:23:14; Add Comment


Some effects of the ZFS DVA format on data layout and growing ZFS pools

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. The short summary of what fields DVAs have and what they mean is that DVAs tell us how to find blocks by giving us their vdev (by number) and their byte offset into that particular vdev (and then their size). A typical DVA might say that you find what it's talking about on vdev 0 at byte offset 0x53a40ed000. There are some consequences of this that I hadn't really thought about until the other day.

Right away we can see why ZFS has a problem removing a vdev; the vdev's number is burned into every DVA that refers to data on it. If there's no vdev 0 in the pool, ZFS has no idea where to even start looking for data because all addressing is relative to the vdev. ZFS pool shrinking gets around this by adding a translation layer that says where to find the portions of vdev 0 that you care about after it's been removed.

In a mirror vdev, any single disk must be enough by itself to recover all data. Since the DVA simply specifies a byte offset within the vdev, this implies that in ZFS mirror vdevs, all copies of a block are at the same place on each disk, contrary to what I once thought might be the case. If vdev 0 is a mirror vdev, our DVA says that we can find our data at byte offset 0x53a40ed000 on each and every disk.

In a RAID-Z vdev, our data lives across multiple disks (with parity) but we only have the byte offset to its start (and then its size). The first implication of this is that in a RAID-Z vdev, a block is always striped sequentially across your disks at basically the same block offsets. ZFS doesn't find one bit of free space on disk 1, a separate bit on disk 2, a third bit on disk 3, and so on, and join them all together; instead it finds a contiguous stripe of free space starting on some disk, and uses it. This space can be short or long, it doesn't have to start on the first disk in the RAID-Z vdev, and it can wrap around (possibly repeatedly).

(This makes it easier for me to understand why ZFS rounds raidzN write sizes up to multiples of N+1 blocks. Possibly I understood this at some point, but if so I'd forgotten it since.)

Another way to put this is that for RAID-Z vdevs, the DVA vdev byte addresses snake across all of the vdev's disks in sequence, switching to a new disk ever asize bytes. In a vdev with a 4k asize, vdev bytes 0 to 4095 are on the first disk, vdev bytes 4096 to 8191 are on the the second disk, and so on. The unfortunate implication of this is that the number of disks in a RAID-Z vdev is an implicit part of the addresses of data in it. The mapping from vdev byte offset to the disk and the disk's block where the block's stripe starts depends on how many disks are in the RAID-Z vdev.

(I'm pretty certain this means that I was wrong in my previous explanation of why ZFS can't allow you to add disks to raidz vdevs. The real problem is not inefficiency in the result, it's that it would blow up your ability to access all data in your vdev.)

ZFS can grow both mirror vdevs and raidz vdevs if you replace the disks with larger ones because in both cases this is just adding more available bytes of space at the top of ZFS's per-vdev byte address range for DVAs. You have to replace all of the disks because in both cases, all disks participate in the addressing. In mirror vdevs this is because you write new data at the same offset into each disk, and in raidz vdevs it's because the addressable space is striped across all disks and you can't have holes in it.

(You can add entire new vdevs because that doesn't change the interpretation of any existing DVAs, since the vdev number is part of the DVA and the byte address is relative to the vdev, not the pool as a whole. This feels obvious right now but I want to write it down for my future self, since someday it probably won't be as clear.)

ZFSDVAFormatAndGrowth written at 22:41:19; Add Comment

Why ZFS is not good at growing and reshaping pools (or shrinking them)

I recently read Mark McBride's Five Years of Btrfs (via), which has a significant discussion of why McBride chose Btrfs over ZFS that boils down to ZFS not being very good at evolving your pool structure. You might doubt this judgment from a Btrfs user, so let me say as both a fan of ZFS and a long term user of it that this is unfortunately quite true; ZFS is not a good choice if you want to modify your pool disk layout significantly over time. ZFS works best if the only change in your pools that you do is replacing drives with bigger drives. In our ZFS environment we go to quite some lengths to be able to expand pools incrementally over time, and while this works it both leaves us with unbalanced pools and means that we're basically forced to use mirroring instead of RAIDZ.

(An unbalanced pool is one where some vdevs and disks have much more data than others. This is less of an issue for us now that we're using SSDs instead of HDs.)

You might sensibly ask why ZFS is not good at this, despite being many years old (and people having had this issue with ZFS for a long time). One fundamental reason is that ZFS is philosophically and practically opposed to rewriting existing data on disk; once written, it wants everything to be completely immutable (apart from copying it to replacement disks, and more or less). But any sort of restructuring or re-balancing of a pool of storage (whether ZFS or Btrfs or whatever) necessarily involves shifting data around; data that used to live on this disk must be rewritten so that it now lives on that disk (and all of this has to be kept track of, directly or indirectly). It's rather difficult to have immutable data but mutable storage layouts.

(In the grand tradition of computer science we can sort of solve this problem with a layer of indirection, where the top layer stays immutable but the bottom layer mutates. This is awkward and doesn't entirely satisfy either side, and is in fact how ZFS's relatively new pool shrinking works.)

This is also the simpler approach for ZFS to take. Not having to support reshaping your storage requires less code and less design (for instance, you don't have to figure out how to reliably keep track of how far along a reshaping operation is). Less code also means less bugs, and bugs in reshaping operations can be catastrophic. Since ZFS was not designed to support any real sort of reshaping, adding it would be a lot of work (in both design and code) and raise a lot of questions, which is a good part of why no one has really tackled this for all of the years that ZFS has been around.

(The official party line of ZFS's design is more or less that you should get your storage right the first time around, or to put it another way, that ZFS was designed for locally attached storage where you start out with a fully configured system rather than incrementally expanding to full capacity over time.)

(This is an aspect of how ZFS is not a universal filesystem. Just as ZFS is not good for all workloads, it's not good for all patterns of growth and system evolution.)

ZFSWhyNoRealReshaping written at 00:20:22; Add Comment


A retrospective on our OmniOS ZFS-based NFS fileservers

Our OmniOS fileservers have now been out of service for about six months, which makes it somewhat past time for a retrospective on them. Our OmniOS fileservers followed on our Solaris fileservers, which I wrote a two part retrospective on (part 1, part 2), and have now been replaced by our Linux fileservers. To be honest, I have been sitting on my hands about writing this retrospective because we have mixed feelings about our OmniOS fileservers.

I will put the summary up front. OmniOS worked reasonably well for us over its lifespan here and looking back I think it was almost certainly the right choice for us at the time we made that choice (which was 2013 and 2014). However it was not without issues that marred our experience with it in practice, although not enough to make me regret that we ran it (and ran it for as long as we did). Part of our issues are likely due to a design mistake in making our fileservers too big, although this design mistake was probably magnified when we were unable to use Intel 10G-T networking in OmniOS.

On the one hand, our OmniOS fileservers worked, almost always reliably. Like our Solaris fileservers before them, they ran quietly for years without needing much attention, delivering NFS fileservice to our Ubuntu servers; specifically, we ran them for about five years (2014 through 2019, although we started migrating away at the end of 2018). Over this time we had only minor hardware issues and not all that many disk failures, and we suffered no data loss (with ZFS checksums likely saving us several times, and certainly providing good reassurances). Our overall environment was easy to manage and was pretty much problem free in the face of things like failed disks. I'm pretty sure that our users saw a NFS environment that was solid, reliable, and performed well pretty much all of the time, which is the important thing. So OmniOS basically delivered the fileserver environment we wanted.

(Our Linux iSCSI backends ran so problem free that I almost forgot to mention them here; we basically got to ignore them the entire time we ran our OmniOS fileserver environment. I think that they routinely had multi-year uptimes; certainly they didn't go down outside of power shutdowns (scheduled or unscheduled).)

On the other hand, we ran into real limitations with OmniOS and our fileservers were always somewhat brittle under unusual conditions. The largest limitation was the lack of working 10G-T Ethernet (with Intel hardware); now that we have Linux fileservers with 10G-T, it's fairly obvious what we were missing and that it did really matter. Our OmniOS fileservers were also not fully reliable; they would lock up, reboot, or perform very badly under an array of fortunately exceptional conditions to a far greater degree than we liked (for example, filesystems that hit quota limits). We also had periodic issues from having two iSCSI networks, where OmniOS would decide to use only one of them for one or more iSCSI targets and we had to fiddle things in magic ways to restore our redundancy. It says something that our OmniOS fileservers were by far the most crash-prone systems we operated, even if they didn't crash very often. Some of the causes of these issues were identified, much like our 10G-T problems, but they were never addressed in the OmniOS and Illumos kernel to the best of my knowledge.

(To be clear here, I did not expect them to be; the Illumos community only has so many person-hours available, and some of what we uncovered are hard problems in things like the kernel memory management.)

Our OmniOS fileservers were also harder for us to manage for an array of reasons that I mostly covered when I wrote about how our new fileservers wouldn't be based on Illumos, and in general there are costs we paid for not using a mainstream OS (costs that would be higher today). With that said, there are some things that I currently do miss about OmniOS, such as DTrace and our collection of DTrace scripts. Ubuntu may someday have an equivalent through eBPF tools, but Ubuntu 18.04 doesn't today.

In the final summary I don't regret us running our OmniOS servers when we did and for as long as we did, but on the whole I'm glad that we're not running them any more and I think our current fileserver architecture is better overall. I'm thankful for OmniOS's (and thus Illumos') faithful service here without missing it.

PS: Some of our OmniOS issues may have been caused by using iSCSI instead of directly attached disks, and certainly using directly attached disks would have made for smaller fileservers, but I suspect that we'd have found another set of problems with directly attached disks under OmniOS. And some of our problems, such as with filesystems that hit quota limits, are very likely to be independent of how disks were attached.

OmniOSFileserverRetrospective written at 22:14:24; Add Comment


Some additional information on ZFS performance as you approach quota limits

My first entry on this subject got some additional information from Allan Jude and others on Twitter, which I'm going to replicate here:

@alanjude: re: <my entry>] - Basically, when you are close to the quota limit, ZFS will rate-limit incoming writes as it has to be sure you won't go over your quota. You end up having to wait for the pending transactions to flush to find out how much room you have left

I was turned on to the issue by @garrett_wollman who uses quotas at a large institution similar to yours. I expect you won't see the worst of it until you are within 100s of MB of the quota. So it isn't being over 95% or something, so much as being 'a few transactions' from full

@garrett_wollman: Turning off compression when the dataset gets near-full clears the backlog (obviously at a cost), as does increasing the quota if you have the free space for it.

@thatcks: Oh interesting! We have compression off on most of our datasets; does that significantly reduce the issue (although presumably not completely eliminate it)?

(Sadly we have people who (sometimes) run pools and filesystems that close to their quota limits.)

@garrett_wollman: I don't know; all I can say is that turning compression off on a wedged NFS server clears the backlog so requests for other datasets are able to be serviced.

All of this makes a bunch of sense, given the complexity of enforcing filesystem size limits, and it especially makes sense that compression might cause issues here; any sort of compression creates a very uncertain difference between the nominal size and the actual on-disk size, and ZFS quotas are applied to the physical space used, not the logical space.

(I took a quick look in the ZFS on Linux source code but I couldn't spot anything that was obviously different when there was a lot of quota room left.)

ZFSFullQuotaPerformanceIssueII written at 16:49:13; Add Comment


ZFS performance really does degrade as you approach quota limits

Every so often (currently monthly), there is an "OpenZFS leadership meeting". What this really means is 'lead developers from the various ZFS implementations get together to talk about things'. Announcements and meeting notes from these meetings get sent out to various mailing lists, including the ZFS on Linux ones. In the September meeting notes, I read a very interesting (to me) agenda item:

  • Relax quota semantics for improved performance (Allan Jude)
    • Problem: As you approach quotas, ZFS performance degrades.
    • Proposal: Can we have a property like quota-policy=strict or loose, where we can optionally allow ZFS to run over the quota as long as performance is not decreased.

(The video of the September meeting is here and the rolling agenda document is here; you want the 9/17 portion.)

This is very interesting to me because of two reasons. First, in the past we have definitely seen significant problems on our OmniOS machines, both when an entire pool hits a quota limit and when a single filesystem hits a refquota limit. It's nice to know that this wasn't just our imagination and that there is a real issue here. Even better, it might someday be improved (and perhaps in a way that we can use at least some of the time).

Second, any number of people here run very close to and sometimes at the quota limits of both filesystems and pools, fundamentally because people aren't willing to buy more space. We have in the past assumed that this was relatively harmless and would only make people run out of space. If this is a known issue that causes serious performance degradation, well, I don't know if there's anything we can do, but at least we're going to have to think about it and maybe push harder at people. The first step will have to be learning the details of what's going on at the ZFS level to cause the slowdown.

(It's apparently similar to what happens when the pool is almost full, but I don't know the specifics of that either.)

With that said, we don't seem to have seen clear adverse effects on our Linux fileservers, and they've definitely run into quota limits (repeatedly). One possible reason for this is that having lots of RAM and SSDs makes the effects mostly go away. Another possible reason is that we haven't been looking closely enough to see that we're experiencing global slowdowns that correlate to filesystems hitting quota limits. We've had issues before with somewhat subtle slowdowns that we didn't understand (cf), so I can't discount that we're having it happen again.

ZFSFullQuotaPerformanceIssue written at 00:32:45; Add Comment


ZFS is not a universal filesystem that is always good for all workloads

Every so often, people show up on various ZFS mailing lists with problems where ZFS is performing not just a bit worse than other filesystems or the raw disks, but a lot worse. Often although not always, these people are using raidz on hard disks and trying to do random IO, which doesn't work very well because of various ZFS decisions. When this happens, whatever their configuration and workload, the people who are trying out ZFS are surprised, and this surprise is reasonable. Most filesystems today are generally good and also generally have relatively flat performance characteristics, where you can't make them really bad unless you have very unusual and demanding workloads.

Unfortunately, ZFS is not like this today. For all that I like it a lot, I have to accept the reality that ZFS is not a universal filesystem that works fine in all reasonable configurations and under all reasonable workloads. ZFS usually works great for many real world workloads (ours included), but there are perfectly reasonable setups where it will fall down, especially if you're using hard drives instead of SSDs. Raidz is merely an unusually catastrophic case (and an unusually common one, partly because no one expects RAID-5/6 to have that kind of drawback).

(Many of the issues that cause ZFS problems are baked into its fundamental design, but as storage gets faster and faster their effects are likely to diminish a lot for most systems. There is a difference between 10,000 IOPs a second and 100,000, but it may not matter as much as a difference between 100 a second and 1,000. And not all of the issues are about performance; there is also, for example, that there's no great solution to shrinking a ZFS pool. In some environments that will matter a lot.)

People sometimes agonize about this and devote a lot of effort to pushing water uphill. It's a natural reaction, especially among fans of ZFS (which includes me), but I've come to think that it's better to quickly identify situations where ZFS is not a good fit and recommend that people move to another filesystem and storage system. Sometimes we can make ZFS fit better with some tuning, but I'm not convinced that even that is a good idea; tuning is often fragile, partly because it's often relatively specific to your current workload. Sometimes the advantages of ZFS are worth going through the hassle and risk of tuning things like ZFS's recordsize, but not always.

(Having to tune has all sorts of operational impacts, especially since some things can only be tuned on a per-filesystem or even per-pool basis.)

PS: The obvious question is what ZFS is and isn't good for, and that I don't have nice convenient answers for. I know some pain points, such as raidz on HDs with random IO and the lack of shrinking, and others you can spot by looking for 'you should tune ZFS if you're doing <X>' advice, but that's not a complete set. And of course some of the issues today are simply problems with current implementations and will get better over time. Anything involving memory usage is probably one of them, for obvious reasons.

ZFSNotUniversal written at 22:06:47; Add Comment


What happens in ZFS when you have 4K sector disks in an ashift=9 vdev

Suppose, not entirely hypothetically, that you've somehow wound up with some 4K 'advance format' disks (disks with a 4 KByte physical sector size but 512 byte emulated (aka logical) sectors) in a ZFS pool (or vdev) that has an ashift of 9 and thus expects disks with a 512 byte sector size. If you import or otherwise bring up the pool, you get slightly different results depending on the ZFS implementation.

In ZFS on Linux, you'll get one ZFS Event Daemon (zed) event for each disk, with a class of vdev.bad_ashift. I don't believe this event carries any extra information about the mismatch; it's up to you to use the information on the specific disk and the vdev in the event to figure out who has what ashift values. In the current Illumos source, it looks like you get a somewhat more straightforward message, although I'm not sure how it trickles out to user level. At the kernel level it says:

Disk, '<whatever>', has a block alignment that is larger than the pool's alignment.

This error is not completely correct, since it's the vdev ashift that matters here, not the pool ashift, and it also doesn't tell you what the vdev ashift or the device ashift are; you're once again left to look those up yourself.

(I was going to say that the only likely case is a 4K advance format disk in an ashift=9 vdev, but these days you might find some SSDs or NVMe drives that advertise a physical sector size larger than 4K.)

This is explicitly a warning, not an error. Both the ZFS on Linux and Illumos code have the a comment to this effect (differing only in 'post an event' versus 'issue a warning'):

 * Detect if the alignment requirement has increased.
 * We don't want to make the pool unavailable, just
 * post an event instead.

This is a warning despite the fact that your disks can accept IO for 512-byte sectors because what ZFS cares about (for various reasons) is the physical sector size, not the logical one. A vdev with ashift=9 really wants to be used on disks with real 512-byte physical sectors, not on disks that just emulate them.

(In a world of SSDs and NVMe drives that have relatively opaque and complex internal sizes, this is rather less of an issue than it is (or was) with spinning rust. Your SSD is probably lying to you no matter what nominal physical sector size it advertises.)

The good news is that as far as I can tell, this warning has no further direct effect on pool operation. At least in ZFS on Linux, the actual disk's ashift is only looked up in one place, when the disk is opened as part of a vdev, and the general 'open a vdev' code discards it after this warning; it doesn't get saved anywhere for later use. So I believe that ZFS IO, space allocations, and even uberblock writes will continue as before.

(Interested parties can look at vdev_open in vdev.c. Disks are opened in vdev_disk.c.)

That ZFS continues operating after this warning doesn't mean that life is great, at least if you're using HDs. Since no ZFS behavior changes here and ZFS can do a using disks with 4K physical sectors in an ashift=9 vdev will likely leave your disk (or disks) doing a lot of read/modify/write operations when ZFS does unaligned writes (as it can often do). This both performs relatively badly and leaves you potentially exposed to damage to unrelated data if there's a power loss part way through.

(But, as before, it's a lot better than not being able to replace old dying disks with new working ones. You just don't want to wind up in this situation if you have a choice, which is a good part of why I advocate for creating basically all pools as 'ashift=12' from the start.)

PS: ZFS events are sort of documented in the zfs-events manpage, but the current description of vdev.bad_ashift is not really helpful. Also, I wish that the ZFS on Linux project itself had the current manpages online (well, apart from as manpage source in the Github repo, since most people find manpages in their raw form to be not easy to read).

ZFS4KDiskWithAshift9 written at 21:29:48; Add Comment


Some things on the GUID checksum in ZFS pool uberblocks

When I talked about how 'zpool import' generates its view of a pool's configuration, I mentioned that an additional kernel check of the pool configuration is that ZFS uberblocks have a simple 'checksum' of all of the GUIDs of the vdev tree. When the kernel is considering a pool configuration, it rejects it if the sum of the GUIDs in the vdev tree doesn't match the GUID sum from the uberblock.

(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)

I was all set to write about how this interacts with the vdev configurations that are in ZFS labels, but as it turns out this is no longer applicable. In versions of ZFS that have better ZFS pool recovery, the vdev tree that's used is the one that's read from the pool's Meta Object Set (MOS), not the pool configuration that was passed in from user level by 'zpool import'. Any mismatch between the uberblock GUID sum and the vdev tree GUID sum likely indicates a serious consistency problem somewhere.

(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)

The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.

(The GUID sum changes if any rearrangement of the vdev tree happens. This includes replacing one disk with another, since each disk has a unique GUID sum. In case you're curious, the ZFS disk label always has the full tree for a top level vdev, including the special 'replacing' and 'spare' sub-vdevs that show up during these operations.)

PS: My guess from a not very extensive look through the kernel code is that it's very hard to tell from user level if you have a genuine uberblock GUID sum mismatch or another problem that returns the same extended error code to user level. The good news is that I think the only other case that returns VDEV_AUX_BAD_GUID_SUM is if you have missing log device(s).

ZFSUberblockGUIDSumNotes written at 22:51:41; Add Comment

How 'zpool import' generates its view of a pool's configuration

Full bore ZFS pool import happens in two stages, where 'zpool import' puts together a vdev configuration for the pool, passes it to the kernel, and then the kernel reads the real pool configuration from ZFS objects in the pool's Meta Object Set. How 'zpool import' does this is outlined at a high level by a comment in zutil_import.c; to summarize the comment, the configuration is created by assembling and merging together information from the ZFS label of each device. There is an important limitation to this process, which is that the ZFS label only contains information on the vdev configuration, not on the overall pool configuration.

To show you what I mean, here's relevant portions of a ZFS label (as dumped by 'zdb -l') for a device from one of our pools:

   txg: 5059313
   pool_guid: 756813639445667425
   top_guid: 4603657949260704837
   guid: 13307730581331167197
   vdev_children: 5
       type: 'mirror'
       id: 3
       guid: 4603657949260704837
       is_log: 0
           type: 'disk'
           id: 0
           guid: 7328257775812323847
           path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6'
           type: 'disk'
           id: 1
           guid: 13307730581331167197
           path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'

(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)

Based on this label, 'zpool import' knows what the GUID of this vdev is, which disk of the vdev it's dealing with and where the other disk or disks in it are supposed to be found, the pool's GUID, how many vdevs the pool has in total (it has 5) and which specific vdev this is (it's the fourth of five; vdev numbering starts from 0). But it doesn't know anything about the other vdevs, except that they exist (or should exist).

When zpool assembles the pool configuration, it will use the best information it has for each vdev, where the 'best' is taken to be the vdev label with the highest txg (transaction group number). The label with the highest txg for the entire pool is used to determine how many vdevs the pool is supposed to have. Note that there's no check that the best label for a particular vdev has a txg that is anywhere near the pool's (assumed) current txg. This means that if all of the modern devices for a particular vdev disappear and a very old device for it reappears, it's possible for zpool to assemble a (user-level) configuration that claims that the old device is that vdev (or the only component available for that vdev, which might be enough if the vdev is a mirror).

If zpool can't find any labels for a particular vdev, all it can do in the configuration is fill in an artificial 'there is a vdev missing' marker; it doesn't even know whether it was a raidz or a mirrored vdev, or how much data is on it. When 'zpool import' prints the resulting configuration, it doesn't explicitly show these missing vdevs; if I'm reading the code right, your only clue as to where they are is that the pool configuration will abruptly skip from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.

There's an additional requirement for a working pool configuration, although it's only checked by the kernel, not zpool. The pool uberblocks have a ub_guid_sum field, which must match the sum of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll get one of those frustrating 'a device is missing somewhere' errors on pool import. An entirely missing vdev naturally forces this to happen, since all of its GUIDs are unknown and obviously not contributing what they should be to this sum. I don't know how this interacts with better ZFS pool recovery.

ZFSZpoolImportAssembly written at 01:18:58; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.