Wandering Thoughts archives

2009-09-23

A brief overview of the Solaris 10 nvpair library

Solaris 10 code often uses a data structure called nvpairs (or nvlists), including all throughout the ZFS code, even in the kernel. There doesn't seem to be a good overview of the libnvpair library, and since I've just spent the past couple of weeks up to my elbows in the OpenSolaris ZFS codebase it seems like a good time to write down what I've learned about nvlists and nvpairs before I forget it.

The basic overview is that an nvlist is a list, possibly nested, of names (keys) and values, aka nvpairs. An nvpair has a name (a C string, although Solaris code always uses #defines for them and so hides this in header files), a value, and the type of the value. There are a lot of possible nvpair value types, but two important ones are nvlists and nvlist arrays. A nvlist value is just another nested nvlist; an nvlist array is, well, an array of them. As an example, in ZFS pool configurations a vdev is an nvlist and the list of devices in it is an nvlist array.

(In common practice there can't be duplicate keys in a given nvlist. While you can create nvlists where this is not the case, this causes a lot of the convenient 'get the element with name X' retrieval functions to stop working.)

The Solaris code I've seen has pretty much used nvlists as if they were some peculiar and awkward hybrid of C structs and Python dictionaries. (For example, the ZFS kernel code keeps a bunch of things in nvlists instead of unpacking them into structs.)

That the keys are strings and nvpairs carry type information makes an nvlist self-describing; if you want, you can print out an arbitrary nvlist without knowing anything about its structure beforehand (including descending into its children). However, nvlists are not self-identifying in the way that, say, XML files usually are, in that there is no particular label that will tell you what a random nvlist is for or represents.

Nvlists can be serialized to disk and then loaded back again. As you might guess from everything so far, a ZFS pool's configuration is stored as an nvlist that is then embedded in the pool. For that matter, ZFS device labels are also nvlists.

(It is probably not quite true that everything that moves in Solaris 10 is an nvlist, but it sometimes feels like it.)

SolarisNvpairLibrary written at 01:50:11; Add Comment

2009-09-14

Listing file locks on Solaris 10

Suppose that you want to find out what files are locked on your Solaris 10 machine, perhaps because it is an NFS server and you really want a global view of what locks you have. On a Linux machine (or a sufficiently old Solaris one) you could use lslk, but it's never been ported to Solaris 10; instead you need to use 'mdb -k', the kernel debugger.

The 'simple' command to get a list of files with file locks is:

echo '::walk lock_graph | ::print lock_descriptor_t l_vnode | ::vnode2path' | mdb -k | sort -u

We need the sort -u because it's reasonably common for our servers to have multiple locks (presumably non-overlapping) against the same file. Your situation may differ.

The mdb ::lminfo command gives you somewhat more information about all of the locks, but it has the drawback that it truncates filenames. What I know about the various fields it prints:

  • the bits of the FLAG field come from flock_impl.h, specifically the section about the l_status field. Note that the FLAG field is printed in hex.

  • the PID and COMM fields are not necessarily meaningful for client locks on an NFS server. Locks from clients sometimes have a PID of 0 (which shows as as a COMM of <kernel>), sometimes have impossible PIDs (which show as <defunct>, because there's no such PID on the NFS server), and if you're unlucky have the PID (and COMM) of an actual process on your NFS server.

    (See here for a possible discussion of this issue.)

  • I believe that you can spot file locks from NFS clients by a FLAG value that includes 0x2000 (aka LOCKMGR_LOCK).

Further digging (to, for example, find the name of the client that theoretically holds the lock) is hampered by the NFS lock manager code unfortunately not being part of the OpenSolaris code base because it's not open source. (Allegedly Sun can't share it because it contains third-party code.)

Possibly, even ideally, there's a better way to get this sort of information. If so, I've been unable to find it.

Sidebar: what the mdb command does

This is a sufficiently complicated mdb command sequence that I feel like breaking it down (if only so that I'll remember how it works later).

::walk lock_graph Iterate over the 'locking graph', which has all file locks in the system. This yields a series of addresses of lock_descriptor_t structures.
::print lock_descriptor_t l_vnode Print the l_vnode pointer from each lock descriptor. You can guess what this points to.
::vnode2path Get the pathname given the vnode pointer.

If you're doing this inside an interactive mdb session, you can append '! sort -u' to sort it too.

ListingFileLocks written at 23:57:31; Add Comment

2009-09-12

My opinions on when you should let ZFS handle RAID stuff

Given the previous entry, here's my opinions so far on when you should let ZFS handle your redundancy and when you should have your storage backend do RAID. (This assumes that you have a storage backend; if not, well, you don't have much choice.)

  • if you are doing mirroring, you really want to let ZFS handle it because ZFS will do a much better job of handling small problems than your storage backend can.

  • if you have a SAN and need your frontends to survive temporary glitches in a single backend unit, you still need to use cross-backend mirroring with ZFS handling it.

    (This is our situation.)

  • if you have a heavy random read worklog and can only afford the space overheads of RAID-5 or RAID-6, you pretty much have to let your backend handle the RAID stuff in order to keep your read rate up. You can play games with multiple smaller raidz or raidz2 vdevs, but you start losing more and more space to parity overheads and I'm not sure you get all that much for it.

(Thinking about it, with raidz you effectively add a disk's IOPs for every extra disk that you lose to parity overhead. Each vdev can do IO independently but only gives you one disk's IOPs, and costs you one additional disk of overhead.)

If you are seriously worried about data integrity, you probably want to let ZFS handle the RAID stuff unless you have lots of spare disk bandwidth. Otherwise you will be scanning all of your disks twice, once in the storage backend to verify the RAID array and once in ZFS to guard against all of the other things that can go wrong.

In general (as the title of the previous entry says) I think that you are better off having ZFS handle the redundancy unless there are strong reasons otherwise. Roughly speaking, how much better off depends on how much you spent on your storage backend; the cheaper the backend, the less that you want it doing RAID or indeed much of anything at all except getting out of your way.

(This is true in general, of course.)

ZFSWhenRaid written at 01:23:19; Add Comment

2009-09-11

Why you should let ZFS handle the RAID stuff

In a comment on my last entry, Matt Simmons asked why you'd let ZFS handle RAID level issues instead of just handling them in your storage backend (SAN or otherwise). Having ZFS do this is the recommended practice, for a number of reasons (some of which the ZFS FAQs will tell you about).

Here's the reasons for letting ZFS handle the RAID stuff that I know about (or at least can think of right now):

  • ZFS can reliably repair corrupted blocks in mirrors, because it can tell which copy of a block is bad and which is good.
  • ZFS should be able to recover better from subtle RAID-5/6 corruption. A backend RAID setup can fix an outright read error just as well as ZFS can, but I believe that ZFS can also fix things if a sector gets corrupted instead of simply reporting read errors.

  • ZFS reconstructs things faster and better after a disk failure than a SAN backend can.

  • ZFS will probably run faster if it's scheduling IO relatively close to the physical disks. This especially applies if the physical disks are actually shared between pools behind ZFS's back (as is typical if you define a big RAID-5 or RAID-6 array and then carve it up into LUNs, each of which you turn into a separate pool).

  • some write loads on RAID-5 or RAID-6 may run faster.

  • without its own redundancy (across different backends), ZFS will panic your system if one of your backends is temporarily unavailable for more than a relatively short amount of time.

Having ZFS handle RAID level issues has some drawbacks, though. They includes at least:

  • ZFS's versions of RAID-5 and RAID-6 have significantly lower random read rates.

  • You have to count on ZFS to get spares right.

  • various things hang if some of your disks go away in the wrong way, even though your pools are still working.

  • ZFS is much less prepared for long term storage management than serious (and seriously expensive) 'enterprise' SAN systems, many of which can do really quite sophisticated rearrangements of your SAN arrays.

I would hope that storage backends have better status monitoring and reporting tools than ZFS currently does (because ZFS's are fairly bad), but I am not at all sure that that is the case.

ZFSWhyOwnRaid written at 01:20:22; Add Comment

2009-09-10

What I know about how ZFS actually handles spare disks

Like many other RAID-oid systems, ZFS has a notion of spare disks; you can add one or more spare disks to a pool, and ZFS will use them as necessary in order to maintain pool redundancy in the face of disk problems. For details, you can see the zpool manpage.

Well, sort of. Actually, how ZFS handles spare disks is significantly different from how normal RAID systems handle them, and the pleasantly bland and normal description of spares in the zpool manpage elides a significant number of important things. The following is what I have been able to gather about the situation from various sources (since Sun doesn't seem to actually document it).

In a traditional RAID system with spares, spare handling is part of the main RAID code in the kernel, with spares activated automatically when needed. In Solaris this is not the case; the only thing that the kernel ZFS code does is keep track of the list of spares and some state information about them. Activating a spare is handled by user-level code, which issues the equivalent of 'zpool replace <pool> <old-dev> <spare-dev>' through a library call. Specifically, activating ZFS spares is the job of the zfs-retire agent of fmd, the Solaris fault manager daemon.

(Once zfs-retire activates the spare, the ZFS kernel code handles the rest of the process, including marking the spare in use and setting up the special 'this device is replaced with a spare' vdev. This means that you can duplicate a spare activation by doing a 'zpool replace' by hand if you ever want to.)

In theory, using fmd for this is equivalent to doing it all in the kernel. In practice, your ZFS spare handling is at the mercy of everything working right and it doesn't always do so. For one prominent example, it is up to the zfs-retire module to decide what should cause it to activate a spare, and it has not always done so for everything that degrades a ZFS vdev.

My primary sources for all of this are this Eric Shrock entry and the archives of the zfs-discuss mailing list. Examination of the OpenSolaris codebase has also been useful (although if you are tempted to do this, beware; it does not necessarily correspond with Solaris 10).

Sidebar: what is required for spare activation

In order for a spare to be activated, a great many moving parts of your system have to all be working right. I feel like writing them down (at least the ones that I can think of):

  • fmd has to be running
  • fmd has to be getting (and generating) relevant events, which may require various fmd modules to be working correctly
  • the zfs-retire agent has to be working, and to have subscribed to those events
  • zfs-retire has to decide that the event is one that should cause it to activate a spare.
  • zfs-retire has to be able to query the kernel (I think) to get the problem pool's configuration in order to find out what spares are available. (This can fail.)
  • zfs-retire has to be able to issue the necessary 'replace disk' system call.

A further side note on events: in an ideal world, there would be a 'ZFS vdev <X> has been degraded because of device <Y>' event that zfs-retire would listen for. If you think that Solaris lives in this world, I have bad news for you.

ZFSSpareHandling written at 00:51:35; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.