Wandering Thoughts archives


Today's Solaris 10 irritation: the fault manager daemon

More and more, Solaris 10 strikes me as being much like Ubuntu 6.06: a system with plenty of big ideas but only half finished implementations. Today's half implemented idea is fmd, the new fault manager daemon.

One of the things I expect out of a fault monitoring system is that it should not report things as faulted when they are now fine, especially not with scary messages that get dumped on the console at every boot (it's acceptable to report them as faulted and now better, provided that you only do it once). As I discovered today, under some circumstances involving ZFS pools and iSCSI, fmd falls down on this; I got verbose error messages about missing pools (that were there and fine) dumped to the console (and syslog) on every boot.

Unfortunately, I couldn't find any simple way to clear these errors. There is probably a magic fmadm flush incantation, but I couldn't find the right argument, and doing fmadm reset on the two ZFS modules that fmadm config reported didn't do anything. I had to resort to picking event UUIDs out of fmadm faulty output and running fmadm repair on each one.

(And why didn't Sun give the fault manager an option to send email to someone when faults happen? I'd have thought that that would be basic functionality, and it would make it actually useful for us.)

Sidebar: How I got fmd to choke this way

I ran a test overnight that hung the iSCSI target machine, which caused the Solaris machine to reboot and then hang during boot. In the process of straightening all of this out there was a time when the iSCSI machine was refusing connections, which caused the Solaris machine to finally boot but with none of the ZFS pools available. When I brought the iSCSI machine back up, the pools reappeared but the fault manager had somehow latched on to the original 'pool not present' events and kept repeating them.

solaris/FaultManagerIrritation written at 23:23:14; Add Comment

By day for May 2008: 1 2 3 4 5 6 7 8 9 11 12 13 15 17 18 19 20 21 23 24 25 26 28 29 30 31; before May; after May.

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.