How convenience in Prometheus labels for alerts led me into a quiet mistake
In our Prometheus setup, we have a
system of alerts that are in testing, not in production. As I
described recently, this is implemented
by attaching a special label with a special value to each alert,
in our case a '
send' label with the value of '
is set up in our Prometheus alert rules. This is perfectly sensible.
In addition to alerts that are in testing, we also have some machines that aren't in production or that I'm only monitoring on a test basis. Because these aren't production machines, I want any alerts about these machines to be 'testing' alerts, even though the alerts themselves are production alerts. When I started thinking about it, I realized that there was a convenient way to do this because alert labels are inherited from metric labels and I can attach additional labels to specific scrape targets. This means that all I need to do to make all alerts for a machine that are based on the host agent's metrics into testing alerts is the following:
- targets: - production:9100 [...] - labels: send: testing targets: - someday:9100
I can do the same for any other checks, such as Blackbox checks. This is quite convenient, which encourages me to actually set up testing monitoring for these machines instead of letting them go unmonitored. But there's a hidden downside to it.
When we promote a machine to production, obviously we have to make
alerts about it be regular alerts instead of testing alerts.
Mechanically this is easy to do; I move the '
up to the main section of the scrape configuration, which means it
no longer gets the '
send="testing"' label on its metrics. Which
is exactly the problem, because in Prometheus a time series is
identified by its labels (and their values). If you drop a label
or change the value of one, you get a different time series. This
means that the moment we promote a machine to production, it's as
if we dropped the old pre-production version of it and added a
completely different machine (that coincidentally has the same name,
OS version, and so on).
Some PromQL expressions will allow us to awkwardly overcome this if
we remember to use '
ignoring(send)' or '
without(send)' in the
appropriate place. Other expressions can't be fixed up this way;
anything using '
rate()' or '
delta()', for example. A '
across the transition boundary sees two partial time series, not one
What this has made me realize is that I want to think carefully before putting temporary things in Prometheus metric labels. If possible, all labels (and label values) on metrics should be durable. Whether or not a machine is an external one is a durable property, and so is fine to embed in a metric label; whether or not it's in testing is not.
Of course this is not a simple binary decision. Sometimes it may be right to effectively start metrics for a machine from scratch when it goes into production (or otherwise changes state in some significant way). Sometimes its configuration may be changed around in production, and beyond that what it's experiencing may be different enough that you want a clear break in metrics.
(And if you want to compare the metrics in testing to the metrics in production, you can always do that by hand. The data isn't gone; it's merely in a different time series, just as if you'd renamed the machine when you put it into production.)