Wandering Thoughts archives

2011-04-25

An important note about multi-line log message formats

I was vaguely planning to write a blog entry using some current stats from our anti-spam system (either 'how many connections were on the Spamhaus Zen' or 'how correlated is the results of the Spamhaus Zen and our commercial filtering software'). Then I went and looked at the format of the logs from our mail system, which it turned out were not set up to make this at all easy.

In our configuration, Exim logs a warning message when a connection from a DNSBL-listed IP address first reaches the RCPT TO: phase of the SMTP conversation. It also logs a line when it accepts a message. Unfortunately, there is nothing to uniquely and simply tie these two lines together, such as a process ID.

Which leads me to the important note:

Any time you log multiple lines for a single thing, you should make it easy to associate all the lines together.

(Also, you should make this happen by default in your logging configuration.)

Generally this means that all of the log messages should include some unique key (for bonus points, have them all include it in the same spot in the line). What is a suitable unique key depends on how your system works; in many situations the process ID is the right key (and besides, sysadmins like to know it in general because other log messages often mention the PID). In some circumstances you may have to invent your own unique key somehow.

A related issue for multi-line logging is that you should also make it clear when a sequence of messages has ended. Sometimes this is already inherent in what you're logging, but sometimes this needs a new 'end of activity' message. This message may seem redundant and pointless, but it has an important purpose; it lets log processing software know that it can now discard all of the tracking information it was keeping for that 'session'. Without an end message, log processing software either has to resort to heuristics for when it can throw away tracking information or allow its memory usage to grow endlessly. (Often software effectively uses both at once by having very conservative heuristics.)

(Exim does have an 'end of processing' log message for individual mail messages.)

ImportantLogMessageNote written at 01:21:22; Add Comment

2011-04-03

Please don't alert based on percentages

One of the classic mistakes made by monitoring and alerting systems is to alert based on percentages; if something registers at 90% or 95% or whatever, it raises various sorts of alerts. This is a terrible mistake.

(The people who write these monitoring systems love percentage based alerts because they're so easy to do, which in my cynical view is why lots of monitoring systems ship with them.)

The easiest way to see the problem of percentage based alerts is to consider disk space monitoring. Suppose the system alerts based on a filesystem reaching 95% full. Does this give you useful information?

Well, no. First, it doesn't tell you how much disk space is left. 95% full on a 50 GByte filesystem is very much different than 95% full on a 1.5 TByte filesystem; in fact, at 95% used the 1.5 TB filesystem has more free space than the entire 50 GByte filesystem ever had. Filesystem space is one of those cases where you usually care more about absolute numbers than about percentages.

Second, even simple space used doesn't actually tell you if you should panic. What generally matters is not that some quantity has reached an arbitrary value, what matters is whether or not you are going to run out of capacity at some point in the near future. To have some idea of that, you need to know not just the current capacity left but how fast capacity has been consumed. 50 Gbytes free at a space growth rate of 256 Mbytes a day is very different from 50 Gbytes free at a space growth rate of 10 Gbytes a day; you ignore the former (unless you have a very long lead time on getting space) but you really want to pay attention to the latter because you only have a few days left to get more space.

(Similarly you care both about the long term trend rate and any short term deviations from it because both of them can cause you problems.)

Similar issues apply to pretty much any other metric you may be monitoring. Doing useful alerts about capacity problems is just not amenable to simple percentage based solutions, because such solutions are not answering a useful question. If you want to make useful alerts, they should generally at least be based on intelligently chosen absolute numbers.

NoAlertOnPercentages written at 01:55:20; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.