Make sure to keep useful labels in your Prometheus alert rules
Suppose, not entirely hypothetically, that you have some metrics that
are broken out across categories but what you care about are the total
number of things together. For example, you're monitoring some OpenBSD
firewalls and you care about the total number of PF states, but your
metrics break them down by protocol (this information is available
pfctl -ss' output). So your
alert rule is going to be something like:
- alert: TooManyStates expr: sum( pfctl_protocol_entries ) by (server) > 80000 ....
Congratulations, you may have just aimed a gun at your own foot.
If you have additional labels on that
metric that you may want to use in the alert that will result from
this (perhaps the datacenter or some other metadata), you've just
lost them. When you said '
sum(...) by (server)', Prometheus
faithfully did what you said; it summed everything by the server
and as part of that threw away all other labels, because you told
it all that mattered was the '
There are two ways around this. The obvious, simple way that you
may reach for in your haste to fix this issue is to add the additional
metadata label or labels that you care about to the '
expression, so you have, eg, '
sum(...) by (server, datacenter)'.
The problem with this is that you're playing whack-a-mole, having
to add each additional label to the list of labels as you remember
them (or discover problems because they're missing). The better
way is to be explicit about what you want to ignore:
sum( pfctl_protocol_entries ) without (proto)
This will automatically pass through all other labels, including ones that you add in six months from now as part of a metrics reorganization (long after you forgot that 'sum(..) by (...)' special case in one of your alert rules).
After this experience, I've come to think that doing aggregation using 'by (...)' in your alert rules (or recording rules) is potentially dangerous and ought to at least be scrutinized carefully and probably commented. Sometimes there are good reasons for it where you want to narrow down to a known set of common labels or the like, but otherwise it is a potential trap even if it works for your setup today.