Monitoring your logs is mostly a tarpit

August 6, 2023

One of the reactions to my entry on how systemd's auto-restarting of units can hide problems was a suggestion that we should monitor our logs to detect things like this. As it happens, one of my potentially unpopular views is that monitoring your logs is generally a tarpit that isn't worth it. Much of the time you'll spend a great deal of effort to get very little of worth.

The fundamental problem with general log monitoring is that logs are functionally unstructured. Anything can appear in them and in a sufficiently complex environment, anything eventually will. This unstructured randomness means that sorting general signal from general noise is a large and often never-ending job, and if you don't do it well, you wind up with a lot of noise (which makes it almost impossible in practice to spot the signal).

One thing you can theoretically sensibly monitor your logs for is specific, narrow signals of things of interest; for example, you might monitor Linux kernel logs for machine check error messages. The first problem with monitoring this way is that there's no guarantee that the message you're monitoring for won't change. Maybe someday the Linux kernel developers will decide to put more information in their MCE messages and change the format. One reason this happens is that almost no one considers log messages to be an API, and so they feel free to change log messages at whim.

(But in the mean time, maybe you'll derive enough value or reassurance from looking for the current MCE messages in your kernel logs. It's a tradeoff, which I'll get to.)

The second problem with monitoring for specific narrow signals of interest in your logs is that you have to know what they look like. It's easy to say that we'll monitor for the Prometheus host agent crashing and systemd restarting it, but it's much harder to be sure that we have properly identified the complete collection of log messages that signal this happening. Remember, log messages are unstructured, which means it's hard to get a complete inventory of what to look for short of things like reading the (current) program source code to see what it logs.

Finally, all of this potential effort only matters if identifiable problems appear in your logs on a sufficiently regular basis and it's useful to know about them. In other words, problems that happen, that you care about, and probably that you can do something about. If a problem was probably a one time occurrence or occurs infrequently, the payoff from automated log monitoring for it can be potentially quite low (you can see this as an aspect of how alerts and monitoring can never be comprehensive).

But monitoring your logs looks productive and certainly sounds good and proper. You can write some positive matches to find known problems, you can write some negative matches to discard noise, you can 'look for anomalies' and then refine your filtering, and so on. That's what makes it a tarpit; it's quite easy to thoroughly mire yourself in progressively more complex log monitoring.

Written on 06 August 2023.
« How the rc shell handles whitespace in various contexts
Good RPC systems versus basic 'RPC systems' »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 6 22:55:57 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.