Wandering Thoughts archives

2017-12-30

Some details of ZFS DVAs and what some of their fields store

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. DVAs are generally embedded into 'block pointers', and you can find a big comment laying out the entire structure of all of this in spa.h. The two fields of a DVA that I'm interested in today are the vdev and the offset.

(The other three fields are a reserved field called GRID, a bit to say whether the DVA is for a gang block, and asize, the allocated size of the block on its vdev. The allocated size has to be a per-DVA field for various reasons. The logical size of the block and its physical size after various sorts of compression are not DVA or vdev dependent, so they're part of the overall block pointer.)

The vdev field of a DVA is straightforward; it is the index of the vdev that the block is on, starting from zero for the first vdev and counting up. Note that this is not the GUID of the vdev involved, which is what you might sort of expect given a comment that calls it the 'virtual device ID'. Using the index means that ZFS can never shuffle the order of vdevs inside a pool, since these indexes are burned into DVAs stored on disk (as far as I know, and this matches what zdb prints, eg).

The offset field tells you where to find the start of the block on the vdev in question. Because this is an offset into the vdev, not a device, different sorts of vdevs have different ways of translating this into specific disk addresses. Specifically, RAID-Z vdevs must generally translate a single incoming IO at a single offset to the offsets on multiple underlying disk devices for multiple IOs.

At this point we arrive at an interesting question, namely what units the offset is in (since there are a bunch of possible options). As far as I can tell from looking at the ZFS kernel source code, the answer is that the DVA offset is in bytes. Some sources say that it's in 512-byte sectors, but as far as I can tell this is not correct (and it's certainly not in larger units, such as the vdev's ashift).

(This doesn't restrict the size of vdevs in any important way, since the offset is a 63-bit field.)

Update, August 31st 2022: On disk, ZFS DVA offsets are actually stored as (512-byte) blocks, but code apparently mostly deals with them as byte offsets. See ZFSDVAOffsetsInBytesII for more details.

One potentially important consequence of this is that DVA offsets are independent of the sector size of the underlying disks in vdevs. Provided that your vdev asize is large enough, it doesn't matter if you use disks with 512-byte logical sectors or the generally rarer disks with real 4k sectors (both physical and logical), and you can replace one with the other. Well, in theory, as there may be other bits of ZFS that choke on this (I don't know if ZFS's disk labels care, for example). But DVAs won't, which means that almost everything in the pool (metadata and data both) should be fine.

PS: There are additional complications for ZFS gang blocks and so on, but I'm omitting that in the interests of keeping this manageable.

ZFSDVAOffsetVdevDetails written at 01:49:19; Add Comment

2017-12-23

Our next generation of fileservers will not be based on Illumos

Our current generation of ZFS NFS fileservers are based on OmniOS. We've slowly been working on the design of our next generation for the past few months, and one of the decisions we've made is that unless something really unusual happens, we won't be using any form of Illumos as the base operating system. While we're going to continue using ZFS, we'll be basing our fileservers on either ZFS on Linux or FreeBSD (preferably ZoL, because we already run lots of Linux machines and we don't have any FreeBSD ones).

This is not directly because of uncertainties around OmniOS CE's future (or the then lack of a LTS release that I wrote about here, because it now has one). There is really no single cause that could change our minds if it was fixed or changed; instead there are multiple contributing factors. Ultimately we made our decision because we are not in love with OmniOS and we no longer think we need to run it in order to get what we really want, which is ZFS with solid NFS fileservice.

However, I feel I need to mention some major contributing factors. The largest single factor is our continued lack of confidence in Illumos's support for Intel 10G-T chipsets. As far as I can tell from the master Illumos source, nothing substantial has changed here since back in 2014, and certainly I don't consider it a good sign that the ixgbe driver still does kernel busy-waits for milliseconds at a time. We consider 10G-T absolutely essential for our next generation of fileservers and we don't want to take chances.

(If you want to see how those busy-waits happens, look at the definition of msec_delay in ixgbe_osdep.h. drv_usecwait is specifically defined to busy-wait; it's designed to be used for microsecond durations, not millisecond ones.)

Another significant contributing factor is our frustrations with OmniOS's KYSTY minimalism, which makes dealing with our OmniOS machines more painful than dealing with our Linux ones (even the Linux ones that aren't Ubuntu based). And yes, having differently named commands does matter. It's possible that another Illumos based distribution could do better here, but I don't think there's a better one for our needs and it would still leave us with our broad issues with Illumos.

It's undeniable that we have more confidence in Linux on the whole than we do in Illumos. Linux is far more widely and heavily used, generally supports more hardware (and does so more promptly), and we've already seen that Intel 10G-T cards work fine in it (we have them in a number of our existing Linux machines, where they run great). Basically the only risk area is ZFS on Linux, and we have FreeBSD as a fallback.

There are some aspects of OmniOS that I will definitely miss, most notably DTrace. Modern Linux may have more or less functional equivalents, but I don't think there's anything that's half as usable. However on the whole I have no sentimental attachments to Solaris or Illumos; I don't hate it, but I won't miss it on the whole and an all-Linux environment will make my life simpler.

(This decision is only partly related to our decision not to use a SAN in the next generation of fileservers. While we could probably use OmniOS with the local disk setup that we want, not having to worry about Illumos's hardware support for various controller hardware does make our lives simpler.)

IllumosNoFutureHere written at 00:11:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.