2008-10-27
What we keep track of for ZFS pools
One of the attractions of ZFS pools is that they are pretty self-documenting. A ZFS pool automatically captures a great deal of the basic information that you need to deal with it, such as the filesystems it contains and NFS export settings. But this isn't quite all of the information that you need for long term management and for disaster recovery, and in our new fileserver environment we've opted to duplicate some of the information outside of ZFS as well.
(Keeping track of filesystems, export permissions, and so on is important in a SAN environment where you can expect to move filesystems between physical servers, for example for disaster recovery. Having a SAN that lets other machines get at the actual data doesn't help if the information needed to manage the data is locked up on a single machine.)
We have two primary additional sources of information on our ZFS pools.
First, we have a master configuration file for all of our filesystems
that lists their name, the pool they are part of, and any non-default
options on them; for us this is quotas, reservations, and any special
NFS export permissions. (We have used a syntax that lets us easily say
'add this netgroup to the exports', because this is our common case;
writing 'rw+=cluster2' is much better than repeating the entire long
share options with a minor change, even if it requires custom software.)
Second, we periodically harvest information about all pools on each fileserver and save it to a central location. The information includes:
- pool size and pool quota, if any
- the LUNs that the pool is using
- the filesystems that it's currently configured with, and their actual quota and reservation settings.
(We also create a reverse index from LUNs to what they are being used for; among other uses, this is good for picking out unused LUNs to use when adding new pools or expanding existing ones.)
This pool information is partly for disaster recovery and partly to have all of the relatively current data about all pools and LUNs in one spot, instead of having to log in to a fileserver and poke around with sometimes obscure ZFS commands.
We don't directly save a mapping between ZFS pools and what virtual
fileserver they belong to, because we have adopted a pool naming
convention that puts the virtual fileserver in the pool name itself.
Thus, in a disaster recovery situation all we have to do is scan the
'zpool import' list for pools with a particular prefix to know what
pools belong where.
(The fileserver information is indirectly contained in the pool information dump in normal circumstances, as each machine's data winds up in a separate file, and the filesystem configuration file does say what fileserver each filesystem is supposed to be on, so you can sort of reverse engineer it from there.)
2008-10-23
Another update to the ZFS excessive prefetching situation
This is a quick and, unfortunately, overdue update. The last time
around, I wrote about how we had discovered
the zfs_arc_min ZFS tuning parameter, so that we could set a minimum
ARC size so that it would not shrink too much.
Here is an important update: don't do this.
Since we set a minimum ARC size, we have had mysterious but ultimately
consistently reproduceable system failures under load. The consistent
way to kill our (then) lead NFS fileserver was that after at least a day
and half or so of uptime, doing a sequential write of enough data into
a ZFS pool would either cause the system to be unable to fork() or,
sometimes, lock it up entirely. (We did not get a crash dump; by the I
had worked out how to do this, we had identified the zfs_arc_min
setting as the culprit.)
So apparently the ZFS ARC was shrinking and staying shrunk for a good reason after all, and stopping it from doing so can cause serious problems.
Our current decision is to run without ZFS tuning parameters at all, in the hopes that we won't see the excessive prefetching in real life. One reason we're willing to accept this risk is that ZFS prefetching can be turned off (and on) dynamically, so if we do run into the issue we can zap prefetching off until it goes away.
(If we start running into the issue routinely, we can even write a shell script to monitor the ARC size and enable or disable prefetching automatically.)
2008-10-10
Some notes about iSCSI multipathing in Solaris
Our new fileservers have multiple network paths to the backend iSCSI storage, which means that we needed to set up some sort of multipathing. Theoretically, there's at least three different ways of doing this: bonding multiple network interfaces together, a native iSCSI feature for this called 'multiple connections per session' (MC/S), and high level, cross device multipathing. In practice, the complexity of getting network bonding working in a cross-vendor environment gave me hives and neither Solaris nor our Linux iSCSI targets support MC/S, so we've using MPxIO, Solaris's high level multipathing system.
(My impression is that MC/S would be the best solution if both ends supported it for reasons that don't fit into this entry.)
MPxIO is a little peculiar, especially in its interactions with iSCSI; it seems to have really been built for a previous generation of technology and drivers, and not really adopted well to iSCSI. It works, but various things are a little bit awkward. The bits that spring particularly to mind are:
- persuading MPxIO to actually multipath iSCSI things is a pain,
as it requires you to stuff magic options into
/kernel/drv/scsi_vhci.conf. Fortunately our target software always advertises a constant vendor and product ID. - the
mpathadmcommand is basically useless. - the combination of MPxIO and iSCSI leaves you with no way to
explicitly control which network path is the active path.
This makes it a good thing that MPxIO defaults to round-robin
load balancing.
- you have to remember to explicitly set the Solaris iSCSI initiator
to make more than one connection per target.
- under some circumstances, the Solaris iSCSI initiator will make two connections and multipathing will look perfectly happy, except that the two connections are over the same network path.
The last issue seems to happen if only one of the two networks is available when the iSCSI code is bringing up the target. One way to have this happen is that if you add statically configured iSCSI targets (ie you have targetA,IP1 and targetA,IP2), the first target IP address added will get both connections and the second target IP address will see none. You can cure the situation by temporarily changing the initiator to only make one connection per target and then setting it back to the normal value, but this temporarily drops the second connection to all targets.
(The only way I've found to notice this is to pay close attention to
the IP addresses shown in 'iscsiadm list target -v' and spot the
duplication.)
2008-10-08
How we set up our Solaris ZFS-based NFS fileservers
A SAN backend is fairly useless without a front end, so we have some of those too. First, a brief bit of background: our overall goal is to provide flexible NFS fileservice. The actual SAN front end servers do nothing except NFS service; all user work happens on other machines.
That said, the NFS fileservers are Solaris 10 x86 servers, specifically
Solaris 10 Update 5. Hardware wise, they are SunFire X2200s with 8 GB
of RAM, although we may have to increase the amount of RAM later. They
all have two system disks and use mirrored system partitions (through
Solaris Volume Manager; even if S10U5 supported using ZFS for the
root filesystem, I wouldn't trust it). They are mostly stock Solaris
installs; we use Blastwave to get useful software
like Postfix and tcpdump, and pca to manage
patches (to the extent that we patch at all). Time synchronization
is done by the Solaris NTP daemon, talking to our local set of NTP
servers.
All data space is accessed through iSCSI from the iSCSI backends and managed through ZFS. Since the backends are exporting more or less raw disk space, we use ZFS to create our RAID redundancy by mirroring all storage chunks between two separate iSCSI target servers. This wastes half the disk space and may cost a certain amount of write performance, but it goes very fast on read IO (especially random read IO) and gives us significant redundancy, including against iSCSI target failures.
(Note that this setup makes it vital that all iSCSI LUNs are exactly the same size, which is one reason we carve the physical disks up into multiple LUNs.)
Rather than randomly pair up LUNs on targets whenever we need a new mirrored ZFS vdev, we have a convention where two targets are always paired together (thus, each disk and LUN on target A will always be paired with the same disk and LUN on target B, and likewise for targets C and D). We use local software to avoid horrible accidents when managing ZFS pools, since we always want to create and grow ZFS pools with mirrored pairs instead of single disks.
Although we are not doing failover, we have engineered the environment to support it in the future. All iSCSI storage is visible to all fileserver machines (we're not doing any sort of storage fencing), and we are using virtual names and IP aliases for the logical fileservers, so that we could move a logical fileserver to a different physical one if we needed to. We have adopted a ZFS pool naming convention that puts the logical fileserver name in the pool's name, so that in an emergency we can easily see which pools a given (logical) fileserver is supposed to have.
(We have actually tested by-hand failover without problems. It's not
that difficult, just slow due to zpool import issues and potentially
dangerous.)
Because ZFS keeps track of things like filesystems, NFS exports, and so on in the ZFS pools themselves, the fileservers are effectively generic; there is no per-fileserver configuration necessary beyond their host names and IP addresses and associated data like ssh keys.
While the fileservers have all of our user accounts so that we can see file ownership properly and so on, users are not allowed to log in to them. System staff are, but we have remapped our home directories so that we have local, non-ZFS home directories on each (physical) fileserver instead of our normal real home directories. Fileservers do not NFS mount anything, because it's not necessary if users can't log in and it keeps them more independent.
(As you might guess, this implies that we do email delivery over NFS; so far this has not caused problems. The fileservers themselves run our standard 'null client' Postfix mail configuration that just forwards all locally generated email to our central mail submission point, so that they can send us administrative email and so on.)