Logs are invisible (at least most of the time and by default)

January 18, 2022

Suppose, hypothetically, that you have a program that does something (perhaps it's a metric collection agent), and you know that it's possible for it to encounter a problem while in operation. So you decide that if there's a problem, your program will emit a log message. Now, as they say, you have two problems, because logs are invisible. Well, more specifically, things reported (only( in logs are invisible in almost all environments.

The reality of logs is that almost all of the time, nothing is looking at them. You can't. There are too many things being logged and they're too variegated. In most environments people only look at logs as part of troubleshooting, or maybe once in a blue moon to see if anything jumps out at them. The rest of the time, logs are written and then they sit there in case they're needed later.

If you want to actually surface problems instead of just recording them, you need something else in addition to the log messages. Perhaps you need a special log that's only written to with problems (and then something to alert about the log having contents). Perhaps you can use a metric (if you expose metrics). Perhaps you need to signal something. But you need to do something.

Speaking from plenty of personal experience, it's very tempting to ignore this. Logging a message is generally quite easy, while every other reasonable way of attracting attention is much harder (and often specific to your environment, which is to say how it is today; much logging is universal). But if you just log a message on problems, it's pretty certain you're going to find out about them by some other means (hopefully not by something exploding).

(A corollary of this is that if log messages are primarily read during troubleshooting, you should make them as useful for that as possible.)

PS: One way around this is to monitor your logs for messages that you know your programs log when they hit problems, or that you've otherwise found out indicate problems. This requires extra work to set up and often extra work to maintain. Also, now you get to watch out because your messages (or parts of them) have become an API between your programs and your generalized monitoring. Worse, it's a decoupled API that's not actually checked, so one side can drift out of sync with the other without anything noticing.

(This thought is brought to you by me discovering that one of the Prometheus metrics agents we run had discovered a problem on one host, and the only sign of it was log messages that I only noticed in passing. In theory we could have spotted this problem from some side effects on the exposed metrics; in practice we didn't know what to look for until it happened and I could observe the side effects.)

Written on 18 January 2022.
« Pipx and a problem with changing the system Python version
When I might expect simultaneous multithreading to help »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 18 21:09:58 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.