How ZFS on Linux names disks in ZFS pools

August 18, 2017

Yesterday I covered how on Illumos and Solaris, disks in ZFS pools have three names; the filesystem path, the 'physical path' (a PCI device name, similar to the information that lspci gives), and a 'devid', with the vendor, model name, and serial number of the disk. While these are Solaris concepts, Linux has similar things and you could at least mock up equivalents of them in the kernel.

ZFS on Linux doesn't try to do this. Instead of having three names, it has only one:

# zdb -C vmware2
MOS Configuration:
[...]
  children[0]:
    type: 'disk'
    id: 0
    guid: 8206543908042244108
    path: '/dev/disk/by-id/ata-ST500DM002-1BC142_Z2AA6A4E-part1'
    whole_disk: 0
[...]

ZoL stores only the filesystem path to the device, using whatever path that you told it to use. To get the equivalent of Solaris devids and physical paths, you need to use the right sort of filesystem path. Solaris devids roughly map to /dev/disk/by-id and physical paths map to /dev/disk/by-path (and there isn't really an equivalent to Solaris /dev/dsk names, which are more stable than Linux /dev/sd* names).

The comment about this in vdev_disk_open in vdev_disk.c discusses this in some detail, and it's worth repeating it in full:

Devices are always opened by the path provided at configuration time. This means that if the provided path is a udev by-id path then drives may be recabled without an issue. If the provided path is a udev by-path path, then the physical location information will be preserved. This can be critical for more complicated configurations where drives are located in specific physical locations to maximize the systems tolerance to component failure. Alternatively, you can provide your own udev rule to flexibly map the drives as you see fit. It is not advised that you use the /dev/[hd]d devices which may be reordered due to probing order. Devices in the wrong locations will be detected by the higher level vdev validation.

(It's a shame that this information exists only as a comment in a source file that most people will never look at. It should probably be in large type in the ZFS on Linux zpool manpage.)

This means that with ZFS on Linux, you get only one try for the disk to be there; there's no fallback the way there is on Illumos for ordinary disks. If you've pulled an old disk and put in a new one and you use by-id names, ZoL will see the old disk as completely missing. If you use by-path names and you move a disk around, ZoL will not wind up finding the disk in its new location the way ZFS on Illumos probably would.

(The net effect of this is that with ZFS on Linux you should normally see a lot more 'missing device' errors and a lot fewer 'corrupt or missing disk label' errors than you would in the same circumstances on Illumos or Solaris.)

At this point, you might wonder how you change what sort of of name ZFS on Linux is using for disks in your pool(s). Although I haven't done this myself, my understanding is that you export the pool then import it again using the -d option to zpool import. With -d, the import process will end up finding the disks for the pool using the type of names that you want, and then actually importing the pool will rewrite the saved path data in the pool's configuration (and /etc/zfs/zpool.cache) to use these new names as a side effect.

(I'm not entirely sure how I feel about this with ZFS on Linux. I think I can see some relatively obscure failure modes where no form of disk naming works as well as things do in Illumos. On the other hand, in practice using /dev/disk/by-id names is probably at least as good an experience as Illumos provides, and the disk names are always clear and explicit. What you see is what you get, somewhat unlike Illumos.)


Comments on this page:

By sysAdmin&Cat Herder at JCVI at 2020-10-23 16:47:23:

Thanks for the informative blog. We also run zfs on linux at my work, I've found a few challenges that are either poorly documented or not documented much

How do you deal with multipath disk names? Multipathd creates its own alias for the 2 paths that are present. . . for instance mpatha ---> sdc and sdav This is further complicate by official documentation with instructions on create a /dev/by-vdev

Are there any specific tools/scripts you use to light up the leds in the JBOD to replace disks? I believe zed scripts are supposed to be able handle this but I've not been able to configure it properly in a multipath'd system and have resorted to a messy bash script

By cks at 2020-10-23 20:10:40:

Our ZFS on Linux based fileservers don't have any multipath devices, so we haven't had to deal with that particular set of problems (thankfully). We refer to disks by their /dev/disk/by-path identifier in our Linux pool setups, partly because that maps statically to a particular chassis slot on our hardware. We don't try to light up drive LEDs with special tools, although sometimes we will verify that we have the right inactive disk by observing its activity LED is solid off, then starting a 'dd' read and observing that its activity LED is now on solid.

(This only works if the disk we want to replace is still alive enough to respond. But totally dead disks are generally pretty visible if you watch for a while, because normally all of our disks have regular activity. A single completely inactive disk is very suggestive.)

By sysAdmin&Cat Herder at JCVI at 2020-10-26 12:24:37:

Thanks for the prompt reply. While there is multipathing documentation for zfs, I've found documentation of actual sample deployments or deployments in the wild to be rare. When you're dealing with multiple JBODs with 60 or 90 drives, identifying drives becomes a pain point. I'm wondering if multipathing is worth the effort.

If you are not using multipathing I'd recommend https://github.com/damicon/zfswatcher It provides a nice simple webUI to light up drives.

Lastly, I'd be curious to see some of your zed scripts if you are willing to share them in a blog/github

VP

Written on 18 August 2017.
« The three different names ZFS stores for each vdev disk (on Illumos)
Subnets and early Unix implementations of TCP/IP networking »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 18 02:35:11 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.