Wandering Thoughts archives

2007-11-25

Why we may not be able to use ZFS

We would like to use ZFS in a new fileserver design, but we may not be able to. This is because ZFS in current versions of Solaris has a problem in production environments, namely: if ZFS ever can't write to any pool, it panics your entire machine. It doesn't matter how many other pools are fine, and it doesn't matter what else is running on the machine; down it goes.

(It also doesn't matter if the pool is redundant or not; ZFS behaves the same if enough of a redundant pool is out of action.)

ZFS is unable to write to a pool if:

  • there is a write timeout, for example if an AoE target is unreachable for too long.
  • it loses a device, for example if the Solaris initiator temporarily can't talk to an iSCSI target or the target temporarily rejects a login (which causes all of the disks to be offlined).
  • it thinks that a device has been replaced by a different device, for example if a SAN logical disk's serial number changes for some reason.
  • the device simply rejects writes.

(An unwriteable device is real fun, because ZFS attempts to write to the pool when it brings the pool up; if the pool is one the machine tries to start when it boots, your machine will sit there in an endless loop of boot-time panics.)

In a SAN environment, this can happen whether you export raw disks and have ZFS do the RAID on top of them or export logical disks and have the SAN controller handle the RAID. You can design a raw disk SAN environment where any fileserver can survive losing any one RAID controller, but it is awfully interconnected.

(And if you are willing to lose half your disk space, you can mirror everything across controllers. We have been told by our users that they are not willing to pay for that level of redundancy.)

Sun's fix for this bug is in OpenSolaris Nevada build 77, but is apparently not going to be available in Solaris before Solaris 10 update 5, currently scheduled for April 2008. (I cannot say that I am pleased about this timeline, given that our own bug report is more than six months old and Sun has known about this issue for at least a year or so.)

If we only build our fileserver environment after S10U5 is released, the decision of whether or not to use ZFS is easy.

If we build our new environment before then (and we probably should), the decision is less easy. ZFS gets us significant user features, but using ZFS means we run some risk of catastrophic panics that take down one or several fileservers due to a single SAN backend controller having a temporary glitch.

(We're willing to say that losing the network or multiple controllers is a catastrophic failure, partly because the alternative is a DiskSuite based environment and DiskSuite has its own issues with failures.)

Right now we're mulling over how bad the risks are in practice, and how compelling the benefits of ZFS are in our contemplated design. I probably need to sketch out some sort of failure matrix to try to figure out all of the things that could go wrong and what the impact of each of them would be in a carefully set up environment as compared to an equivalent DiskSuite environment.

(Until very recently, this entry would have had a much stronger title, but Sun actually does have a fix now.)

ZFSWritePanic written at 23:00:47; Add Comment

2007-11-05

Some notes on Solaris 10 U4 x86 as an iSCSI target

The latest release of Solaris 10 (S10U4 or 8/07 depending on how you like to label it) has a built in iSCSI target implementation, and ZFS even has integration with it so that it is easy to export space from ZFS pools. I've been poking it a bit, resulting in some things I want to note down for my own later reference.

  • Solaris 10 iSCSI target stuff is handled with the iscsitadm command, which is helpfully only one letter away from the iscsiadm command to administer Solaris iSCSI initiator settings. The two commands are very similar and take options in the same way, except when they pick different vocabulary for the same command; iscsiadm uses 'add' and 'remove', but iscsitadm uses 'create' and 'delete'.

    (It is possible that this difference is deliberate, in order to prevent accidentally doing an operation with the wrong program. The fly in the ointment is that the command options are generally completely different, so I'm pretty sure that the attempted operation would fail anyways.)

  • the very first thing you need to do is use iscsitadm modify admin -d <dir> to tell the iSCSI target stuff where to store state information, like what targets you've defined. If you do not do this, nothing will complain, but (among other things) all of the targets you've carefully manually defined will disappear when you reboot.

    (I think ZFS targets created with the shareiscsi ZFS property might still persist.)

  • in targets with multiple LUNs, you should make LUN 0 (the first one created) be a little dummy LUN that never actually gets used (a few megabytes of backing file will do). This is because you cannot modify LUNs except by deleting and recreating them, and you cannot delete LUN 0 unless there's no other LUN.

  • setting the ZFS shareiscsi property creates a separate iSCSI target for every shared ZVOL, even if they're inheriting the property from a filesystem. If you need to bundle things into multiple LUNs on the same target, you will need to do things by hand.

  • there seems to be no way to manually set the iSCSI name of a target when you create it. This seems unfortunate and limiting, since there are a number of situations where you need to set the iSCSI name to something specific.

  • iscsitgtd periodically dumps core in /, sometimes with multi-gigabyte core files that will fill up your root filesystem (I have seen it dump over a 5 gigabyte core file).

In general iscsitadm seems biased towards creating a new iSCSI target for every separate bit of target storage that you have. If you are using one of the dynamic discovery methods on your iSCSI initiators this is not too bad, but it is going to be horrible if you're using a static configuration; for static configurations you really want LUNs within a single iSCSI target.

SolarisiSCSITarget written at 23:31:15; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.