Solaris Volume Manager and iSCSI: a problematic interaction

July 25, 2007

Solaris Volume Manager (which I still call DiskSuite) keeps information about the state of its logical volumes in what it calls a 'metadevice state database' (a metadb for short). You normally keep a number of replicas of this state database, scattered around the physical devices that DiskSuite is managing for you. When you are using metasets, all of the metadb replicas have to be on disks in the metaset. This is a logical consequence of the DiskSuite tools needing to update the metadata to reflect which machine owns a metaset; if there was metadata on a disk outside the metaset, DiskSuite on another machine wouldn't necessarily be able to update it.

DiskSuite's approach to dealing with unavailable metadb replicas is simple: DiskSuite panics the system if it loses metadb quorum, where quorum is half of the metadb replicas plus one. This is actually spelled out explicitly in the metadb manpage, along with the reasoning.

(Technically it may survive with exactly half of the metadb replicas; I can't test right now.)

Now we get to the iSCSI side of the problem, namely that if the Solaris iSCSI initiator loses connectivity to an iSCSI target it offlines all of the disks exported by that iSCSI target, which in turn immediately tells DiskSuite that the metadb replicas on all of those disks are now unavailable. If this drops you below quorum in DiskSuite (for any metaset), your system promptly panics.

(This is different from the behavior of FibreChannel, where glitches in FC connectivity just produce IO errors for any ongoing IO and don't yank the metadb replicas out from under DiskSuite.)

The net result is that if you are using Solaris Volume Manager to manage iSCSI-based storage in metasets, you need to build metasets that include disks (logical or otherwise) from at least three different iSCSI targets or the loss of connectivity to a single target will kill your entire machine.

(And you need to carefully balance the number of metadb replicas across all of your targets so that one target doesn't have too many replicas.)


Comments on this page:

By Dan.Astoorian at 2007-07-26 10:31:00:

(Technically it may survive with exactly half of the metadb replicas; I can't test right now.)

It survives (i.e., doesn't panic) with exactly half, but operator intervention is needed to bring the system back up if it goes down: the system boots single-user so that you can fix the problem.

Clearly, if your system had only two disks (or two sets of disks on redundant controllers) with all data mirrored between them, you wouldn't want the system to panic if one of them went down.

(There is an unsupported, undocumented /etc/system tunable to allow the system to boot with exactly 50% of its metadbs, but it's intended specifically for to allow HA systems with mirrored root disks to still come up if one of its disks fails; it's not intended to be used with anything except the root disk.)

The net result is that if you are using Solaris Volume Manager to manage iSCSI-based storage in metasets, you need to build metasets that include disks (logical or otherwise) from at least three different iSCSI targets or the loss of connectivity to a single target will kill your entire machine.

I think two should be sufficient, although three will improve the odds of the machine coming back up unattended if a panic or reboot does occur.

(And you need to carefully balance the number of metadb replicas across all of your targets so that one target doesn't have too many replicas.)

And, ideally, balance them with respect to other failure modes as well, where practical. For example, if 2/3 of your disks are accessed through one network path, and the other 1/3 are available through a different set of switches, you may wish to try to balance the metadbs so that quorum is not lost if either network path goes down.

--Dan

By cks at 2007-07-30 09:20:34:

One of the really tricky bits of balancing metadb replicas is that metaset seems to like putting one on each (logical) disk that you add to the metaset. Since you may add different numbers of logical disks from different targets, you'll need to remember to strip some of the metadb replicas off to maintain the balance between targets.

By Dan.Astoorian at 2007-07-30 10:05:31:

...you'll need to remember to strip some of the metadb replicas off to maintain the balance between targets.

Or, perhaps preferably, to put additional metadbs onto some drives to maintain the balance, rather than stripping them off. (This may require manually repartitioning those drives to make slice 7 larger so that it can hold the number of metadbs you want.)

Disksuite's policy of putting metadbs on every drive has a purpose: it assures that as long as even one drive in the metaset is accessible, the system can determine the membership of the metaset. (Otherwise, I don't believe there would be anything to prevent an administrator from accidentally adding the drive to a different metaset, causing confusion when the other drives in the original metaset come back up.)

--Dan

By cks at 2007-07-30 12:18:25:

In our usage, all of the logical drives coming from a target are actually coming from the same underlying pool of RAID storage, so either all the logical drives are available or none of them are and having metadb replicas on multiple logical drives on the same target doesn't really get us anything.

(Ideally I could tell DiskSuite what was really going on and it would balance metadb replicas based on the underlying storage, but I'm not going to hold my breath for that.)

Written on 25 July 2007.
« Using iSCSI and AOE to create artificial disk errors
An unexpected performance stress test for DWiki »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 25 17:15:47 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.