Wandering Thoughts archives

2010-03-15

How to create pointless error reports (and how not to)

Linux's little love notes about software RAID consistency errors makes a perfect example of something that system administrators run into all the time: pointless error reports.

It's worth noting that a pointless error report is something different from a useless error report. A useless error report tells you that something has gone wrong but doesn't identify what it is, what exactly has gone wrong, and so on; you have to hunt that down on your own. A pointless error report shouldn't even have been generated in the first place, at least not in the form that you get it in. Noise from monitoring systems is one form of pointless error reports.

So what makes a pointless error report? The aforementioned software RAID errors have at least three things wrong with them, namely that the error happens all the time, that the 'error' is actually (in theory) something that happens routinely, and that there's nothing you can do about the error in practice. Complaining about non-errors that happen all the time that you can't do anything about anyways is pretty much the jackpot in terms of pointless error reports.

We can turn this around to create a list of what makes a good error report for sysadmins:

  • it is complaining about a real error (not a routine and theoretically harmless event)
  • ... that does not happen all the time
  • ... that is actively dangerous
  • ... that you can (and should) do something about
  • it contains a clear description of what is wrong
  • it contains all of the details about the situation that are known, provided that those details are useful for resolving the problem (and not merely useful for debugging the code)

Things that fail some of these criteria may be useful to log and capture for historical purposes, but they do not rise to the level of useful error reports. Failing any of the first four points makes an error report pointless; failing the last two makes it more or less useless.

I include 'is actively dangerous' on the list of important points because there are always things happening on any system that might be worthy of note, for example people trying brute force attacks on your ssh port. What should create error reports is not merely something wrong, but something that is bad enought that it needs to be dealt with. Someone failing to get in to your system with ssh is not worthy of a report; someone ssh'ing in to root and getting the password right but being refused access because you have PermitRootLogin set to no in the sshd configuration, now that is worthy of an error report.

sysadmin/GoodErrorReports written at 01:45:05; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.