Wandering Thoughts archives

2008-06-27

Fault hierarchies and problem reports

Here is something that I have come to feel strongly about: things that report problems (as opposed to just log them) should have some idea of root causes and a fault hierarchy. Then when you report things you should report the root cause you've found and only mention the consequences as a side note, instead of screaming about every consequence.

(As a not entirely hypothetical example, it does no good to spam me with notices about lots of ZFS pools being unavailable when the real problem is that the system can't find any network interfaces at boot time so it can't make any iSCSI connections so there are no pool devices.)

Yes, this is difficult and challenging. But it is the job you took on when you decided to write something that actively shoved reports of problems at people. If you cannot do a good job of it, you need to stick to just logging things; this is one of the areas where a tool that does only a half-hearted job can be worse than no tool, because it is trivial to generate an avalanche of surface errors from a single important root cause.

(By 'reporting' I mean aggressively forcing things in front of people through a variety of methods, from dumping messages on the system console to sending email. In short, anything that could interrupt people. Yes, dumping messages on the console is interrupting people; consider what happens to the poor sysadmin who is trying to get the system going again when you dump ten screens of error messages on his session.)

tech/ReportingAndFaultHierarchies written at 00:49:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.