Monitoring your logs is mostly a tarpit

August 6, 2023

One of the reactions to my entry on how systemd's auto-restarting of units can hide problems was a suggestion that we should monitor our logs to detect things like this. As it happens, one of my potentially unpopular views is that monitoring your logs is generally a tarpit that isn't worth it. Much of the time you'll spend a great deal of effort to get very little of worth.

The fundamental problem with general log monitoring is that logs are functionally unstructured. Anything can appear in them and in a sufficiently complex environment, anything eventually will. This unstructured randomness means that sorting general signal from general noise is a large and often never-ending job, and if you don't do it well, you wind up with a lot of noise (which makes it almost impossible in practice to spot the signal).

One thing you can theoretically sensibly monitor your logs for is specific, narrow signals of things of interest; for example, you might monitor Linux kernel logs for machine check error messages. The first problem with monitoring this way is that there's no guarantee that the message you're monitoring for won't change. Maybe someday the Linux kernel developers will decide to put more information in their MCE messages and change the format. One reason this happens is that almost no one considers log messages to be an API, and so they feel free to change log messages at whim.

(But in the mean time, maybe you'll derive enough value or reassurance from looking for the current MCE messages in your kernel logs. It's a tradeoff, which I'll get to.)

The second problem with monitoring for specific narrow signals of interest in your logs is that you have to know what they look like. It's easy to say that we'll monitor for the Prometheus host agent crashing and systemd restarting it, but it's much harder to be sure that we have properly identified the complete collection of log messages that signal this happening. Remember, log messages are unstructured, which means it's hard to get a complete inventory of what to look for short of things like reading the (current) program source code to see what it logs.

Finally, all of this potential effort only matters if identifiable problems appear in your logs on a sufficiently regular basis and it's useful to know about them. In other words, problems that happen, that you care about, and probably that you can do something about. If a problem was probably a one time occurrence or occurs infrequently, the payoff from automated log monitoring for it can be potentially quite low (you can see this as an aspect of how alerts and monitoring can never be comprehensive).

But monitoring your logs looks productive and certainly sounds good and proper. You can write some positive matches to find known problems, you can write some negative matches to discard noise, you can 'look for anomalies' and then refine your filtering, and so on. That's what makes it a tarpit; it's quite easy to thoroughly mire yourself in progressively more complex log monitoring.


Comments on this page:

As it happens, one of my potentially unpopular views is that monitoring your logs is generally a tarpit that isn't worth it.

Under UNIX, it certainly is. For some reason, IBM has no issues with usable logs, and their machines even call them for repairs.

The fundamental problem with general log monitoring is that logs are functionally unstructured.

I would like an elaboration on this part. Does this merely refer to the line-by-line nature of UNIX logs, or to the wide space of possible values? There certainly can be issues with this, but I see not why it should happen with a properly-written program. If a program produces useless logs, then it is worthless.

A proper way to handle it, and the only proper way to handle it I think, is to have a log be a machine-readable and ordered set of values, real values, and not lines of supposed text. There then is the required structure and a real constraint on nonsense.

It's easy to say that we'll monitor for the Prometheus host agent crashing and systemd restarting it, but it's much harder to be sure that we have properly identified the complete collection of log messages that signal this happening.

I find the very thought of such a poorly-documented program to be ridiculous.

Remember, log messages are unstructured, which means it's hard to get a complete inventory of what to look for short of things like reading the (current) program source code to see what it logs.

This is why a log should be entirely composed of records, which should be documented comprehensively.

From 109.37.131.94 at 2023-08-08 06:12:58:

I like to take the opposite approach, all logs are compared against a list of known patterns and lines that do not match are reported.

I've used logcheck in the past, and github.com/fd0/erpel nowadays

By nolist_policy at 2023-08-08 07:48:14:

The first problem with monitoring this way is that there's no guarantee that the message you're monitoring for won't change.

This is a solved problem (at least for the linux kernel) with printk indexing. So you can query the running kernel what log messages it can output. With that your monitoring tool can check when an expected message was changed and you get the messages in printf format so you can parse them if you want.

By phein4242 at 2023-08-08 14:20:25:

Hi Chris,

While in general I agree that logs can be a tarpit, they are also one of the most valuable sources of information during troubleshooting (topped by things like debug/verbose and call tracing/debuggers).

Thing is tho, theory aside, systems do need to be maintained, and that sometimes requires hacks. In this specific case, since you know beforehand that you rollout your services with autorestart, having a restart check is a crucial piece of insight into your application.

And given that this is about systemd units and logging, which has a well-defined api, implementing this check should not be much more then a log tailer/parser, one to three regexps, some control logic and something to send alerts (or reworked into a dedicated exporter).

On the more general level, yes, directly parsing all logging is useless (unless you do it programmatically). Knowing what to look for, and being able to transform that into a more rich context about ypur applications, is very valuable knowledge nevertheless. This is even mandatory if there are no exporters for the app you want to run.

Will parsing logs ever become painless? I highly doubt it, but structuree logs do make the pain a bit less :)

I've been using Marcus J. Ranum's "Artificial Ignorance" method ( https://www.ranum.com/security/computer_security/papers/ai/ ) on a small scale for a long time (wow! decades now! :-) ).

The idea is to write regexes for everything you're not interested leaving, by definition, only interesting things.

It's sometimes a pain to keep the regexes up-to-date as it requires a little discipline but perhaps no more than you have to do to set up any kind of log monitoring system?

One of the most painful (but most useful) is that the regexes tend to under-match rather than over-match which means acquiescing uninteresting things can take a couple of attempts. ...but perhaps that's a problem of matching logs with regex and it's better to have it fail in a way that under-matches something uninteresting rather than under-matches something interesting?

Have you had any experience with anything like this and do you think it'd work at the kinds of scales you operate at (both in terms of team members and machines)?

Written on 06 August 2023.
« How the rc shell handles whitespace in various contexts
Good RPC systems versus basic 'RPC systems' »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Aug 6 22:55:57 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.