2010-03-30
One possible future for Solaris
In light of recent developments, here is one pessimistic view of a future for Oracle's Solaris. I wrote before that I didn't think that Oracle had suddenly decided to get into the operating system business; instead, they might see Solaris as infrastructure for their actual products. Well, take that to its logical end point and what you get is Solaris as Oracle's captive operating system.
As their captive OS, Solaris exists to run Oracle products and not for much else. It gets improvements only when they help Oracle's products (because otherwise they're not cost justifiable; operating systems don't make much money). In particular, it only gets new driver support for hardware that Oracle is interested in Solaris running on, which is probably only hardware that Oracle makes and a few other high end things.
(My native optimism says that Oracle won't entirely give up on Solaris on x86 hardware. But then, I thought that it might be six months before Oracle made Solaris non-free and it took them a lot less than that.)
This is a terribly pessimistic view because it basically predicts the death of Solaris through irrelevance, due to people not being able to run Solaris or OpenSolaris on common inexpensive hardware because it doesn't have the necessary driver support. (This is already somewhat of an issue for Linux, of all operating systems, and it's only going to get worse for a less popular one. Yes, even on server hardware; Ethernet and (E)SATA drivers don't grow on trees, among other things, and Solaris already has problems on Sun's own hardware.)
(As Pete Zaitcev notes, the much more important question for the open source world is what Oracle does about Sun's important open source projects. I care about Solaris's future for entirely selfish reasons, namely that we run it as part of our production environment and there is no equivalent replacement for our ZFS setup.)
2010-03-29
More signs of Oracle's view of Solaris
Well, that was fast. Back in ReadingSolarisTeaLeaves I wrote:
Any free version of Solaris 10 is now basically a sampler, much like Oracle has done with a personal use version of their database, and I wouldn't be surprised if the Solaris license was revised to reflect that in a while.
When I wrote this, I was thinking 'in six months or so'. In fact it was less than a month; as Ben Rockwood wrote recently, Solaris is no longer free to use. You can evaluate it for 90 days, but after that you need a paid-up support contract. It seems that Oracle's view of Solaris's future is getting clearer and clearer.
On the one hand, this doesn't directly affect us; the university has a long-standing general support agreement with Sun, so we're covered. On the other hand, this agreement is renewed on a year to year basis and who knows how much Oracle is going to want for it when renewal time comes around. It would be quite easy for Oracle to price support out of our reach; we can't afford 'enterprise production' costs and even prices like a thousand dollars a year per system would be gulp-inducing.
(It's possible that Oracle has already announced general Solaris support prices, but as usual it's not easy to find this thing on the revised Sun/Oracle website, so I have no idea. All I could find right now was a blurb on 'Oracle Premier Support', which doesn't exactly sound inexpensive.)
I'd like to think that Oracle wouldn't throw away easy money by pricing their support offerings out of our reach, but I'm not that naive. There are all sorts of reasons that Oracle might set quite high support prices, including discouraging small organizations from running Solaris because they cost too much to support in the long run. Also, various Oracle people have apparently said that they view Solaris as their high end offering and, well, high end offerings mean high end prices.
2010-03-24
One reason why 'zpool status' can hang
The 'zpool status' command is infamous for stalling
and hanging exactly when you need it the most, namely when something is
going wrong with your system. I've recently run down one reason why it
does this.
The culprit is my old friend ZFS GUIDs. Information
about disks involved in ZFS pools includes both their GUIDs and their
theoretical device paths. When 'zpool status' prints out the user
friendly shortened device names, it doesn't just take the theoretical
path and trim most things off to get the device name; as I've alluded
to before, it decides to be hyper-correct and check
that the device named in the configuration really is the right device.
In theory this is simple, as Solaris has some system calls for doing
pretty much all of the work. In practice these system calls require
you to open the disk device that you want to check, and under some
circumstances this open() will stall for significant amounts of time
(several minutes, for example). An iSCSI target that isn't responding is
one such circumstance.
If you've ever seen this happen, you might wonder why 'zpool status'
hangs completely immediately before printing any pool configuration
information, instead of getting to the point where it starts to print
device names for affected devices. The answer is that 'zpool status'
is helpfully extra-clever; before it prints out any pool configuration
stuff, it pauses to work out how wide it has to make the device name
column so that everything will line up nicely. This requires working
out the friendly name of all devices, which requires that hyper-correct
checking of the configuration, which stalls the entire process if any
disk is very slow to open().
(For extra fun, this 'calculate the needed width' step also looks at the
spare disks (if any), so a single bad spare disk, one that's not even in
use, can cause 'zpool status' to stall on you.)
The 'zpool iostat' command also does the same extra-clever step of
working out the maximum width of the device name column, so it will
stall for the same reason. For bonus points, 'zpool iostat' does
this every time it prints out a round of statistics. Yes, really.
No wonder plain iostat is acres better if anything bad is going
on.
By the way, this particular stall only happens if you have permissions
to open the device in the first place, ie it only happens if you are
root. So if you suspect ZFS problems, especially if you want 'zpool
iostat' results, run the commands as a non-root user.
(This is not the only way that zpool status can stall; I've seen it
stutter when it was trying to get the ZFS pool configuration from the
kernel.)
Sidebar: where this is in the code
The zpool source code is usr/src/cmd/zpool, and the whole width
calculation stuff is in zpool_main.c:max_width(). This calls
zpool_vdev_name(), in lib/libzfs/common/libzfs_pool.c, which
calls path_to_devid(), which actually open()'s the device. This
check is thoughtfully guarded to make sure that it doesn't open devices
that ZFS has gotten around to declaring are actually bad; sadly, ZFS
makes such declarations long after open()'s of iSCSI target disks have
started stalling for minutes at a time.
2010-03-17
How Solaris 10's mountd works
Due to security and complexity issues, Unix systems vary somewhat in exactly how they handle the server side of doing NFS mounts. I've recently been digging in this area, and this is what I've learned about how it works in Solaris 10.
The server side of mounting things from a Solaris 10 fileserver goes more or less like this:
- a client does the usual SUNRPC dance with portmapper and then sends
an RPC mount request to
mountd - if the filesystem is not exported at all or if the options in the mount
request are not acceptable at all,
mountddenies the request. mountdchecks to see if the client has appropriate permissions. This will probably include resolving the client's IP address to a hostname and may include netgroup lookups. This process looks only atro=andrw=permissions, and thus will only do 'is host in netgroup' lookups for netgroups mentioned there.- if the client passes,
mountdlooks up the NFS filehandle of the root of what the client asked for and sends off an RPC reply, saying 'your mount request is approved and here is the NFS filehandle of the root of it'.
You'll notice that mountd has not told the kernel about the client
having access rights for the filesystem.
- at some time after the client kernel accepts the mount, it will
perform its first NFS request to the fileserver. (Often this
happens immediately.)
- if the fileserver kernel does not have information about whether
IP <X> is allowed to access filesystem <Y> in its authorization
cache, it upcalls to
mountdto check. mountdgoes through permissions checking again, with slightly different code; this time it also looks at anyroot=option and thus will do netgroup lookups for those netgroups too.mountdreplies to the kernel's upcall (we hope) with the permissions the client IP should have, which may be 'none'. The Solaris kernel puts this information in its authorization cache.
The mount daemon has a limit on how many simultaneous RPC mount requests it can be processing; this is 16 by default. There is some sort of limits on kernel upcalls, I believe including a timeout on how long the kernel will wait for any given upcall to finish before giving up, but I don't know what they are or how to find them in the OpenSolaris code.
Because this process involves doing the permissions checks twice
(and checks multiple NFS export options), it may involve a bunch
of duplicate netgroup lookups. Since netgroup lookups may be
expensive, mountd caches the result of all 'is host <X> in netgroup
<Z>' checks for 60 seconds, including negative results. This
mountd cache is especially relevant for us given our custom NFS
mount authorization.
(The combination of the kernel authorization cache with no timeout and this mountd netgroup
lookup cache means that if you use netgroups for NFS access control,
a single lookup failure (for whatever reason) may have wide-ranging
effects if it happens at the wrong time. A glitch or two during a
revalidation storm could give you a whole lot of basically permanent
negative entries, as we've seen but not
previously fully understood.)
Where to find OpenSolaris code for all this
I'm going to quote paths relative to usr/src, which is the (relative) directory where OpenSolaris puts all code in its repository.
The mountd source is in cmd/fs.d/nfs/mountd. Inside mountd:
- the RPC mount handling code is in mountd.c:mount(). It checks NFS mount permissions as a side effect of calling the helpfully named getclientsflavors_new() or getclientsflavors_old() functions.
- the kernel upcalls are handled by nfsauth.c:nfsauth_access(), which calls mountd.c:check_client() to do the actual permission checking.
- the netgroup cache handling is done in netgroup.c:cache_check(), which is called from netgroup_check().
The kernel side of the upcall handling is in uts/common/fs/nfs, as
mentioned earlier. The actual upcalling
and cache management happens in nfs_auth.c:nfsauth_cache_get(),
using Solaris doors as the IPC mechanism between mountd and the
kernel.
2010-03-16
The Solaris 10 NFS server's caching of filesystem access permissions
Earlier, I mentioned that modern NFS
servers don't have a comprehensive list of NFS filesystem access
permissions stored in the kernel; instead they have a cache and
some sort of upcall mechanism where the kernel will ask mountd
if a given client has access to a given filesystem if necessary.
I've recently been investigating how Solaris 10 handles this, so
here's what I know of the kernel authorization cache:
First, the Solaris kernel does cache negative entries (this IP address is not allowed to access this filesystem at all). This turns out to be fairly dangerous, because the cache has no timeout. If a negative entry is ever checked and cached, it will stay there until you flush the filesystem's cache entirely.
(The same is true of positive entries that you want to get rid of, either because you've removed a client's authorization or because you want to change how the filesystem is exported to it; part of the cache entry is whether the client has read-write or read-only access, and whether root is remapped or not. Or just because a machine has changed IP address and you want to get rid of any permissions that the old IP address has.)
The overall cache has no size limit at all, beyond a general one set by kernel memory limits. It will get shrunk if the kernel needs to reclaim memory, but even then no entry less than 60 minutes old will be removed. In our environment, such cache reclaims appear to be vanishingly uncommon (ie, completely unseen), based on kernel stats.
There is a separate auth cache for each exported filesystem. As far as I can tell, a filesystem's auth cache is discarded entirely if it is unshared or reshared, including if it is reshared with the same sharing settings. It otherwise effectively never expires entries. Flushing a filesystem's auth cache causes every client to be revalidated the next time that they make an NFS request to that filesystem.
Because all of this is only in kernel memory, all auth caches are lost if the system reboots. Thus on fileserver reboot all clients are revalidated for all filesystems on a rolling basis, as each client tries to do NFS to each filesystem that they have mounted. This may provoke a storm of revalidations after the reboot of a popular fileserver with a bunch of clients.
The cache is populated by upcalling to mountd on hopefully infrequent
demand (through mechanisms that are beyond the scope of this entry). If
mountd answers properly its answer of the moment, whatever that is,
gets cached. There are presumably timeout and load limits on these
upcalls, but I don't understand (Open)Solaris code well enough yet to
find them. (I hope that more than one upcall can be in progress at
once.)
Sidebar: Getting cache stats
This is for the benefit of people (such as me) poking around with
mdb -k. The internal NFS server auth cache stats are in
three variables: nfsauth_cache_hit, nfsauth_cache_miss, and
nfsauth_cache_reclaim, which counts how many times a reclaim
has been done (but not how many entries have been reclaimed).
To see them (in hex) one uses the mdb command:
nfsauth_cache_hit ::print
The code for most of this is in nfs_auth.c in
usr/src/uts/common/fs/nfs; see also nfs_export.c, which has the
overall NFS server export list.
2010-03-07
Why I don't expect third-party support for OpenSolaris
One of the common reactions to Oracle's potentially ambivalent attitude towards providing OpenSolaris support is that since OpenSolaris is open source, third parties can spring up to provide support for it even if Oracle doesn't. However, I'm fairly pessimistic about the chances of this; even if OpenSolaris itself becomes reasonably popular, I don't think that we'll ever see an OpenSolaris equivalent of Red Hat or Canonical.
There's two reasons for this. One of them is the difference between forking code and merely supporting it, which comes down to your ability to get your bugfixes accepted upstream. My impression to date is that in practice there are relatively few outside contributors to OpenSolaris and that it is hard to get changes accepted upstream. This pushes anyone attempting to do OpenSolaris support towards de facto forking OpenSolaris, which is expensive and thus makes you unprofitable.
(Some casual searching didn't turn up any information about the rate of outside contributions to OpenSolaris that's more recent than 2008, when the news wasn't good. Certainly the OpenSolaris repository shows very little signs of contributions from outside developers, and there is no sign that the practices described in 2008 have changed much. Note that pushing changes upstream is hard at the best of time; you can imagine how much worse this gets if the upstream is not really interested in the whole business of outside contributions, especially if something is going to require significant amounts of effort and time from upstream developers.)
The other reason is more subtle. In order to really support code, you must have good programmers who understand it. With Sun not really being very enthusiastic about outside contributions, there are not many people like that outside of Sun (or, well, outside of Sun before Oracle took over and people started leaving). In addition, your good OpenSolaris programmers are probably going to face the constant temptation of taking a job with Oracle where they can actually work directly on OpenSolaris; the better they are and the more passionate about OpenSolaris they are, the higher the temptation. The less expert your programmers are the less attractive your support is, since you can't diagnose and fix people's problems as fast or as well.
(And if you can find good expert OpenSolaris programmers right now it's pretty likely that they're quite passionate, given the obstacles to acquiring that expertise.)