Wandering Thoughts archives

2008-10-27

What we keep track of for ZFS pools

One of the attractions of ZFS pools is that they are pretty self-documenting. A ZFS pool automatically captures a great deal of the basic information that you need to deal with it, such as the filesystems it contains and NFS export settings. But this isn't quite all of the information that you need for long term management and for disaster recovery, and in our new fileserver environment we've opted to duplicate some of the information outside of ZFS as well.

(Keeping track of filesystems, export permissions, and so on is important in a SAN environment where you can expect to move filesystems between physical servers, for example for disaster recovery. Having a SAN that lets other machines get at the actual data doesn't help if the information needed to manage the data is locked up on a single machine.)

We have two primary additional sources of information on our ZFS pools. First, we have a master configuration file for all of our filesystems that lists their name, the pool they are part of, and any non-default options on them; for us this is quotas, reservations, and any special NFS export permissions. (We have used a syntax that lets us easily say 'add this netgroup to the exports', because this is our common case; writing 'rw+=cluster2' is much better than repeating the entire long share options with a minor change, even if it requires custom software.)

Second, we periodically harvest information about all pools on each fileserver and save it to a central location. The information includes:

  • pool size and pool quota, if any
  • the LUNs that the pool is using
  • the filesystems that it's currently configured with, and their actual quota and reservation settings.

(We also create a reverse index from LUNs to what they are being used for; among other uses, this is good for picking out unused LUNs to use when adding new pools or expanding existing ones.)

This pool information is partly for disaster recovery and partly to have all of the relatively current data about all pools and LUNs in one spot, instead of having to log in to a fileserver and poke around with sometimes obscure ZFS commands.

We don't directly save a mapping between ZFS pools and what virtual fileserver they belong to, because we have adopted a pool naming convention that puts the virtual fileserver in the pool name itself. Thus, in a disaster recovery situation all we have to do is scan the 'zpool import' list for pools with a particular prefix to know what pools belong where.

(The fileserver information is indirectly contained in the pool information dump in normal circumstances, as each machine's data winds up in a separate file, and the filesystem configuration file does say what fileserver each filesystem is supposed to be on, so you can sort of reverse engineer it from there.)

TrackingZFSPools written at 02:15:50; Add Comment

2008-10-23

Another update to the ZFS excessive prefetching situation

This is a quick and, unfortunately, overdue update. The last time around, I wrote about how we had discovered the zfs_arc_min ZFS tuning parameter, so that we could set a minimum ARC size so that it would not shrink too much.

Here is an important update: don't do this.

Since we set a minimum ARC size, we have had mysterious but ultimately consistently reproduceable system failures under load. The consistent way to kill our (then) lead NFS fileserver was that after at least a day and half or so of uptime, doing a sequential write of enough data into a ZFS pool would either cause the system to be unable to fork() or, sometimes, lock it up entirely. (We did not get a crash dump; by the I had worked out how to do this, we had identified the zfs_arc_min setting as the culprit.)

So apparently the ZFS ARC was shrinking and staying shrunk for a good reason after all, and stopping it from doing so can cause serious problems.

Our current decision is to run without ZFS tuning parameters at all, in the hopes that we won't see the excessive prefetching in real life. One reason we're willing to accept this risk is that ZFS prefetching can be turned off (and on) dynamically, so if we do run into the issue we can zap prefetching off until it goes away.

(If we start running into the issue routinely, we can even write a shell script to monitor the ARC size and enable or disable prefetching automatically.)

ZFSOverPrefetchingUpdateII written at 00:17:12; Add Comment

2008-10-10

Some notes about iSCSI multipathing in Solaris

Our new fileservers have multiple network paths to the backend iSCSI storage, which means that we needed to set up some sort of multipathing. Theoretically, there's at least three different ways of doing this: bonding multiple network interfaces together, a native iSCSI feature for this called 'multiple connections per session' (MC/S), and high level, cross device multipathing. In practice, the complexity of getting network bonding working in a cross-vendor environment gave me hives and neither Solaris nor our Linux iSCSI targets support MC/S, so we've using MPxIO, Solaris's high level multipathing system.

(My impression is that MC/S would be the best solution if both ends supported it for reasons that don't fit into this entry.)

MPxIO is a little peculiar, especially in its interactions with iSCSI; it seems to have really been built for a previous generation of technology and drivers, and not really adopted well to iSCSI. It works, but various things are a little bit awkward. The bits that spring particularly to mind are:

  • persuading MPxIO to actually multipath iSCSI things is a pain, as it requires you to stuff magic options into /kernel/drv/scsi_vhci.conf. Fortunately our target software always advertises a constant vendor and product ID.
  • the mpathadm command is basically useless.

  • the combination of MPxIO and iSCSI leaves you with no way to explicitly control which network path is the active path. This makes it a good thing that MPxIO defaults to round-robin load balancing.

  • you have to remember to explicitly set the Solaris iSCSI initiator to make more than one connection per target.

  • under some circumstances, the Solaris iSCSI initiator will make two connections and multipathing will look perfectly happy, except that the two connections are over the same network path.

The last issue seems to happen if only one of the two networks is available when the iSCSI code is bringing up the target. One way to have this happen is that if you add statically configured iSCSI targets (ie you have targetA,IP1 and targetA,IP2), the first target IP address added will get both connections and the second target IP address will see none. You can cure the situation by temporarily changing the initiator to only make one connection per target and then setting it back to the normal value, but this temporarily drops the second connection to all targets.

(The only way I've found to notice this is to pay close attention to the IP addresses shown in 'iscsiadm list target -v' and spot the duplication.)

SolarisISCSIMultipathing written at 00:43:26; Add Comment

2008-10-08

How we set up our Solaris ZFS-based NFS fileservers

A SAN backend is fairly useless without a front end, so we have some of those too. First, a brief bit of background: our overall goal is to provide flexible NFS fileservice. The actual SAN front end servers do nothing except NFS service; all user work happens on other machines.

That said, the NFS fileservers are Solaris 10 x86 servers, specifically Solaris 10 Update 5. Hardware wise, they are SunFire X2200s with 8 GB of RAM, although we may have to increase the amount of RAM later. They all have two system disks and use mirrored system partitions (through Solaris Volume Manager; even if S10U5 supported using ZFS for the root filesystem, I wouldn't trust it). They are mostly stock Solaris installs; we use Blastwave to get useful software like Postfix and tcpdump, and pca to manage patches (to the extent that we patch at all). Time synchronization is done by the Solaris NTP daemon, talking to our local set of NTP servers.

All data space is accessed through iSCSI from the iSCSI backends and managed through ZFS. Since the backends are exporting more or less raw disk space, we use ZFS to create our RAID redundancy by mirroring all storage chunks between two separate iSCSI target servers. This wastes half the disk space and may cost a certain amount of write performance, but it goes very fast on read IO (especially random read IO) and gives us significant redundancy, including against iSCSI target failures.

(Note that this setup makes it vital that all iSCSI LUNs are exactly the same size, which is one reason we carve the physical disks up into multiple LUNs.)

Rather than randomly pair up LUNs on targets whenever we need a new mirrored ZFS vdev, we have a convention where two targets are always paired together (thus, each disk and LUN on target A will always be paired with the same disk and LUN on target B, and likewise for targets C and D). We use local software to avoid horrible accidents when managing ZFS pools, since we always want to create and grow ZFS pools with mirrored pairs instead of single disks.

Although we are not doing failover, we have engineered the environment to support it in the future. All iSCSI storage is visible to all fileserver machines (we're not doing any sort of storage fencing), and we are using virtual names and IP aliases for the logical fileservers, so that we could move a logical fileserver to a different physical one if we needed to. We have adopted a ZFS pool naming convention that puts the logical fileserver name in the pool's name, so that in an emergency we can easily see which pools a given (logical) fileserver is supposed to have.

(We have actually tested by-hand failover without problems. It's not that difficult, just slow due to zpool import issues and potentially dangerous.)

Because ZFS keeps track of things like filesystems, NFS exports, and so on in the ZFS pools themselves, the fileservers are effectively generic; there is no per-fileserver configuration necessary beyond their host names and IP addresses and associated data like ssh keys.

While the fileservers have all of our user accounts so that we can see file ownership properly and so on, users are not allowed to log in to them. System staff are, but we have remapped our home directories so that we have local, non-ZFS home directories on each (physical) fileserver instead of our normal real home directories. Fileservers do not NFS mount anything, because it's not necessary if users can't log in and it keeps them more independent.

(As you might guess, this implies that we do email delivery over NFS; so far this has not caused problems. The fileservers themselves run our standard 'null client' Postfix mail configuration that just forwards all locally generated email to our central mail submission point, so that they can send us administrative email and so on.)

ZFSFileserverSetup written at 01:07:02; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.