Today's Solaris 10 irritation: the fault manager daemon

May 7, 2008

More and more, Solaris 10 strikes me as being much like Ubuntu 6.06: a system with plenty of big ideas but only half finished implementations. Today's half implemented idea is fmd, the new fault manager daemon.

One of the things I expect out of a fault monitoring system is that it should not report things as faulted when they are now fine, especially not with scary messages that get dumped on the console at every boot (it's acceptable to report them as faulted and now better, provided that you only do it once). As I discovered today, under some circumstances involving ZFS pools and iSCSI, fmd falls down on this; I got verbose error messages about missing pools (that were there and fine) dumped to the console (and syslog) on every boot.

Unfortunately, I couldn't find any simple way to clear these errors. There is probably a magic fmadm flush incantation, but I couldn't find the right argument, and doing fmadm reset on the two ZFS modules that fmadm config reported didn't do anything. I had to resort to picking event UUIDs out of fmadm faulty output and running fmadm repair on each one.

(And why didn't Sun give the fault manager an option to send email to someone when faults happen? I'd have thought that that would be basic functionality, and it would make it actually useful for us.)

Sidebar: How I got fmd to choke this way

I ran a test overnight that hung the iSCSI target machine, which caused the Solaris machine to reboot and then hang during boot. In the process of straightening all of this out there was a time when the iSCSI machine was refusing connections, which caused the Solaris machine to finally boot but with none of the ZFS pools available. When I brought the iSCSI machine back up, the pools reappeared but the fault manager had somehow latched on to the original 'pool not present' events and kept repeating them.

Comments on this page:

From at 2008-06-30 12:34:40:

Absolutely! It seems like email support would have been easy to add, and it sure would make things a whole lot easier. Since I like to get email when faults occur, I wrapper `fmadm faulty' in a shell script that sends email. If you are interested in using it, you can retrieve it here:

Written on 07 May 2008.
« The Bourne shell is not a programming language
Getting live network bandwidth numbers on Linux »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 7 23:23:14 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.