Today's Solaris 10 irritation: the fault manager daemon
More and more, Solaris 10 strikes me as being much like Ubuntu 6.06: a
system with plenty of big ideas but only half finished implementations.
Today's half implemented idea is
fmd, the new fault manager daemon.
One of the things I expect out of a fault monitoring system is that it
should not report things as faulted when they are now fine, especially
not with scary messages that get dumped on the console at every boot
(it's acceptable to report them as faulted and now better, provided that
you only do it once). As I discovered today, under some circumstances
involving ZFS pools and iSCSI,
fmd falls down on this; I got verbose
error messages about missing pools (that were there and fine) dumped to
the console (and syslog) on every boot.
Unfortunately, I couldn't find any simple way to clear these errors.
There is probably a magic
fmadm flush incantation, but I couldn't find
the right argument, and doing
fmadm reset on the two ZFS modules that
fmadm config reported didn't do anything. I had to resort to picking
event UUIDs out of
fmadm faulty output and running
fmadm repair on
(And why didn't Sun give the fault manager an option to send email to someone when faults happen? I'd have thought that that would be basic functionality, and it would make it actually useful for us.)
Sidebar: How I got
fmd to choke this way
I ran a test overnight that hung the iSCSI target machine, which caused the Solaris machine to reboot and then hang during boot. In the process of straightening all of this out there was a time when the iSCSI machine was refusing connections, which caused the Solaris machine to finally boot but with none of the ZFS pools available. When I brought the iSCSI machine back up, the pools reappeared but the fault manager had somehow latched on to the original 'pool not present' events and kept repeating them.