Wandering Thoughts archives

2007-02-21

The quick overview of DiskSuite failover

Solaris 8 DiskSuite does failover on disks (logical or otherwise), not filesystems or partitions. Solaris then gives you up to seven partitions per disk (technically you get eight, but DiskSuite takes one for its metadata); you then use these partitions as the building blocks for mirrors, stripes, and filesystems.

Disks are grouped together into metasets, and one machine in your failover cluster owns each metaset at any given time and is the only machine allowed to do IO to any of its disks. As a consequence, all mirroring, striping, and so on has to be within a metaset. In our setup, each metaset is all of the disks in a virtual NFS server. A single physical system can be the owner of more than one metaset (and thus more than one virtual NFS server).

(Failover itself is done by changing the owner of a metaset, possibly forcibly.)

In Solaris 8, all of the disks in a metaset have to appear as the same devices on all of the machines participating in the failover pool for that metaset (eg, c0t0d0 has to be c0t0d0 everywhere). This is apparently a limitation in the metadata that DiskSuite keeps, and I believe it's been relaxed by Solaris 10. As a practical matter, this means you want identical hardware configurations for all of your fileserver machines.

If you have a SAN and want any of your filesystems to have SAN RAID controller redundancy (so the filesystem keeps going even if one controller falls over), this means the filesystem's metaset must include disks from more than one controller. Unless you dedicate all of two controllers to a single (probably very big) metaset, you will probably wind up with a situation where a single SAN RAID controller has (different) disks in several different metasets. This unfortunately complicates load calculations and working out what the consequences are of taking a single controller down.

(In the extreme case the failure of a single controller could affect all of your virtual NFS servers.)

DiskSuiteFailover written at 12:46:22; Add Comment

2007-02-06

What the Solaris 8 nfs3_max_threads parameter probably controls

We've recently been having a problem where one specific client experiences really slow IO to one specific NFS mounted filesystem, especially for things like file creation. (Other clients can use the filesystem at full speed, and this client can use other filesystems at full speed.)

We immediately thought 'per-filesystem concurrency limits' and soon turned up the Solaris 8 NFS kernel tunable nfs3_max_threads, which the fine documentation helpfully describes as:

This symbol controls the maximum number of async threads started per file system mounted using NFS version 3 software.

That's all very well and good, but it leaves one with an important question: just what are async threads necessary for in a Solaris 8 NFS client?

The Sun documentation is no help; the only mention of async threads is in the documentation for NFS kernel tunables. (Par for the course, Google has a more useful search engine for docs.sun.com than Sun does. Also, I have to say that docs.sun.com is achingly slow.)

The online O'Reilly Managing NFS and NIS book mentions the parameter, but is of no help in clarifying things. On the one hand it says that they're only used for readahead and writeback, but on the other hand it says that the number of async threads puts a limit on how much outstanding IO a client can dump on a server per filesystem (which cannot be true unless all NFS IO goes through async threads).

Interestingly, the Solaris 9 documentation says (about the NFS v3 case):

The operations that can be executed asynchronously are read for read-ahead, readdir for readdir read-ahead, write for putpage and pageio requests, and commit.

I don't know how much has changed in this code between Solaris 8 and Solaris 9, but I think it's likely to be pretty close (and it's the best answer I'll probably get, short of a chance to spelunk the Solaris 8 kernel code). And points to Sun for writing useful documentation about it only slightly too late to be directly useful for our systems.

(Unfortunately, all of this winds up suggesting that the problem going away after we cranked nfs3_max_threads up to 32 from the default of 8 may just have been coincidence. Sometimes ignorance is bliss, or at least confidence.)

Nfs3MaxThreadsQuest written at 23:24:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.