How convenience in Prometheus labels for alerts led me into a quiet mistake

February 24, 2021

In our Prometheus setup, we have a system of alerts that are in testing, not in production. As I described recently, this is implemented by attaching a special label with a special value to each alert, in our case a 'send' label with the value of 'testing'; this is set up in our Prometheus alert rules. This is perfectly sensible.

In addition to alerts that are in testing, we also have some machines that aren't in production or that I'm only monitoring on a test basis. Because these aren't production machines, I want any alerts about these machines to be 'testing' alerts, even though the alerts themselves are production alerts. When I started thinking about it, I realized that there was a convenient way to do this because alert labels are inherited from metric labels and I can attach additional labels to specific scrape targets. This means that all I need to do to make all alerts for a machine that are based on the host agent's metrics into testing alerts is the following:

- targets:
    - production:9100

- labels:
    send: testing
    - someday:9100

I can do the same for any other checks, such as Blackbox checks. This is quite convenient, which encourages me to actually set up testing monitoring for these machines instead of letting them go unmonitored. But there's a hidden downside to it.

When we promote a machine to production, obviously we have to make alerts about it be regular alerts instead of testing alerts. Mechanically this is easy to do; I move the 'someday:9100' target up to the main section of the scrape configuration, which means it no longer gets the 'send="testing"' label on its metrics. Which is exactly the problem, because in Prometheus a time series is identified by its labels (and their values). If you drop a label or change the value of one, you get a different time series. This means that the moment we promote a machine to production, it's as if we dropped the old pre-production version of it and added a completely different machine (that coincidentally has the same name, OS version, and so on).

Some PromQL expressions will allow us to awkwardly overcome this if we remember to use 'ignoring(send)' or 'without(send)' in the appropriate place. Other expressions can't be fixed up this way; anything using 'rate()' or 'delta()', for example. A 'rate()' across the transition boundary sees two partial time series, not one complete one.

What this has made me realize is that I want to think carefully before putting temporary things in Prometheus metric labels. If possible, all labels (and label values) on metrics should be durable. Whether or not a machine is an external one is a durable property, and so is fine to embed in a metric label; whether or not it's in testing is not.

Of course this is not a simple binary decision. Sometimes it may be right to effectively start metrics for a machine from scratch when it goes into production (or otherwise changes state in some significant way). Sometimes its configuration may be changed around in production, and beyond that what it's experiencing may be different enough that you want a clear break in metrics.

(And if you want to compare the metrics in testing to the metrics in production, you can always do that by hand. The data isn't gone; it's merely in a different time series, just as if you'd renamed the machine when you put it into production.)

Written on 24 February 2021.
« How (and where) Prometheus alerts get their labels
The HTTP Referer header is fading away (at least as a useful thing) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 24 23:01:31 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.