How convenience in Prometheus labels for alerts led me into a quiet mistake
In our Prometheus setup, we have a
system of alerts that are in testing, not in production. As I
described recently, this is implemented
by attaching a special label with a special value to each alert,
in our case a '
send' label with the value of '
is set up in our Prometheus alert rules. This is perfectly sensible.
In addition to alerts that are in testing, we also have some machines that aren't in production or that I'm only monitoring on a test basis. Because these aren't production machines, I want any alerts about these machines to be 'testing' alerts, even though the alerts themselves are production alerts. When I started thinking about it, I realized that there was a convenient way to do this because alert labels are inherited from metric labels and I can attach additional labels to specific scrape targets. This means that all I need to do to make all alerts for a machine that are based on the host agent's metrics into testing alerts is the following:
- targets: - production:9100 [...] - labels: send: testing targets: - someday:9100
I can do the same for any other checks, such as Blackbox checks. This is quite convenient, which encourages me to actually set up testing monitoring for these machines instead of letting them go unmonitored. But there's a hidden downside to it.
When we promote a machine to production, obviously we have to make
alerts about it be regular alerts instead of testing alerts.
Mechanically this is easy to do; I move the '
up to the main section of the scrape configuration, which means it
no longer gets the '
send="testing"' label on its metrics. Which
is exactly the problem, because in Prometheus a time series is
identified by its labels (and their values). If you drop a label
or change the value of one, you get a different time series. This
means that the moment we promote a machine to production, it's as
if we dropped the old pre-production version of it and added a
completely different machine (that coincidentally has the same name,
OS version, and so on).
Some PromQL expressions will allow us to awkwardly overcome this if
we remember to use '
ignoring(send)' or '
without(send)' in the
appropriate place. Other expressions can't be fixed up this way;
anything using '
rate()' or '
delta()', for example. A '
across the transition boundary sees two partial time series, not one
What this has made me realize is that I want to think carefully before putting temporary things in Prometheus metric labels. If possible, all labels (and label values) on metrics should be durable. Whether or not a machine is an external one is a durable property, and so is fine to embed in a metric label; whether or not it's in testing is not.
Of course this is not a simple binary decision. Sometimes it may be right to effectively start metrics for a machine from scratch when it goes into production (or otherwise changes state in some significant way). Sometimes its configuration may be changed around in production, and beyond that what it's experiencing may be different enough that you want a clear break in metrics.
(And if you want to compare the metrics in testing to the metrics in production, you can always do that by hand. The data isn't gone; it's merely in a different time series, just as if you'd renamed the machine when you put it into production.)
How (and where) Prometheus alerts get their labels
In Prometheus, you can and usually do
have alerting rules
that evaluate expressions to create alerts. These alerts are usually
passed to Alertmanager and they
are visible in Prometheus itself as a couple of metrics,
ALERTS_FOR_STATE. These metrics can be used to do things
like find out the start time of alerts
or just display a count of currently active alerts on your dashboard. Alerts almost always have labels
(and values for those labels), which tend to be used in Alertmanager
templates to provide additional information along side annotations,
which are subtly but crucially different.
All of this is standard Prometheus knowledge and is well documented, but what doesn't seem to be well documented is where alert labels come from (or at least I couldn't find it said explicitly in any of the obvious spots in the documentation). Within Prometheus, the labels on an alert come from two places. First, you can explicitly add labels to the alert in the alert rule, which can be used for things like setting up testing alerts. Second, the basic labels for an alert are whatever labels come out of the alert expression. This can have some important consequences.
If your alert expression is a simple one that just involves basic
metric operations, for example '
node_load1 > 10.0', then the basic
labels on the alert are the same labels that the metric itself has;
all of them will be passed through. However, if your alert expression
narrows down or throws away some labels, then those labels will be
missing from the end result. One of the ways to lose metrics in
alert expressions is to use '
because this discards all labels other than the '
by (whatever)' label
or labels. You can also deliberately pull in labels from additional
metrics, perhaps as a form of database
lookup (and then you can use these additional
labels in your Alertmanager setup).
Prometheus itself also adds an
alertname label, with the name of
the alert as its value. The
ALERTS metric in Prometheus also
alertstate label, but this is not passed on to the version
of the alert that Alertmanager sees. Additionally, as part of
sending alerts to Alertmanager, Prometheus can relabel
alerts in general to do things like canonicalize some labels. This
can be done either for all Alertmanager destinations or only for a
particular one, if you have more than one of them set up. This
only affects alerts as seen by Alertmanager; the version in the
ALERTS metric is unaffected.
(This can be slightly annoying if you're building Grafana dashboards that display alert information using labels that your alert relabeling changes.)
PS: In practice, people who use Prometheus work out where alert labels come from almost immediately. It's both intuitive (alert rules use expressions, expression results have labels, and so on) and obvious once you have some actual alerts to look at. But if you're trying to decode Prometheus on your first attempt, it and the consequences aren't obvious.