2007-01-23
How to stop DiskSuite resyncing a mirror on Solaris 8
One of our irritations with DiskSuite has been that it has no way to abort a mirror resync short of forcing drive errors, something that we actually had to resort to once and which is not always possible, for example when your drives are logical LUNs on FibreChannel RAID controllers.
As we found out recently, fortunately not quite the hard way, it turns
out that there actually is a way to abort a mirror resync. If your
resyncing mirror is in a diskset, running 'metaset -s <set> -r' to
release the diskset so you can fail it over to another machine will
first abort the resync and then fail.
(It fails with the helpful, ever so explanatory error message 'metaset:
<host>: Device busy'. If you repeat the metaset command again, it
will work and actually release the diskset, although your higher-level
failover tools may be in a bit of a tizzy at this point. The machine
that picks up the diskset seems to automatically restart the resync,
although it starts again from scratch.)
I can sort of see why DiskSuite behaves this way, but its current behavior is annoyingly half-hearted. Regardless of what you really wanted to happen in this situation, you wind up with about half of it.
(My grumpyness is increased because I suspect the only reason
metaset aborts the resync is that otherwise you would have no
way of releasing the diskset until the resync had finished.
Which is all DiskSuite's fault, for not having a way of explicitly
aborting resyncs.)
2007-01-09
Solving an automounter timeout problem with brute force
Our central mail machine runs various cron jobs as part of its work. Starting recently, every now and then a cron job (or a command run out of an alias) would randomly die with an error like:
sh: /cs/foo/adm/script: cannot execute
(Where /cs/foo is NFS mounted through the automounter, and the cron
entry just runs that script.)
I am pretty sure that this is a gift from the Solaris 8 automounter.
Our central mail machine is pretty old and pokey, and we recently switched to a new method of authenticating NFS mounts that requires a ssh callback. So my operating theory is that this is the charmingly non-specific error you get when the NFS mount reply is too slow in coming and the automounter just gives up.
My current brute force solution is a little script I call 'keepmounted':
for i in $@; do
nohup sh -c "cd $i && (while :; do sleep 604800; done)" >/dev/null 2>&1 </dev/null &;
done
(The sleep value is more or less arbitrary.)
Then I just ran it for every automounted filesystem that we saw problems with and moved on to other fires. (Yes, at some point I need a better solution, but the machine is rebooted only rarely and we're working on replacing it anyways.)
(This sort of cheap hack is a surprisingly common occurrence in system administration. Sometimes a bandaid is really the best solution.)
2007-01-01
Solaris's impressive ABI compatibility
There are some things that Solaris is very good at; one of them is user-level ABI compatibility (at least for basic programs). As an illustration of just how good it is, I only recently noticed that on our Solaris 8 machines I am still routinely using some dynamically linked programs compiled in August of 1993 (which is probably when this group started using Solaris machines).
I hadn't noticed before now because the programs hadn't really changed since then (so I had no need to upgrade them), and because the compiled versions just kept on working so I didn't have to pay attention to them.
(Until recently, my Solaris version of rc was a statically linked
binary from March 1994. I can't claim this as an unqualified success,
because I was goaded into replacing it with a current version by an
obscure glitch in some circumstances. But it's pretty striking that
I didn't have any problems in normal use.)
My Solaris binaries that use X have fared less well, although this may be because X environments themselves have changed significantly since 1994. (For example, back then programs could get away with not dealing with TrueColor displays, especially 24-bit and 32-bit ones.)
(The program itself starts up, but fails with a BadMatch 'invalid parameter attributes' error on an X_PolyFillRectangle call. I wonder if I can dig up an 8-bit PseudoColor display somewhere around here to test it against; unfortunately Xnest can only force a PseudoColor visual as the default visual if the underlying X server has one to start with, and modern X servers and hardware don't seem to.)