Make sure to keep useful labels in your Prometheus alert rules

March 19, 2020

Suppose, not entirely hypothetically, that you have some metrics that are broken out across categories but what you care about are the total number of things together. For example, you're monitoring some OpenBSD firewalls and you care about the total number of PF states, but your metrics break them down by protocol (this information is available in 'pfctl -ss' output). So your alert rule is going to be something like:

- alert: TooManyStates
  expr: sum( pfctl_protocol_entries ) by (server) > 80000
  ....

Congratulations, you may have just aimed a gun at your own foot. If you have additional labels on that pfctl_protocol_entries metric that you may want to use in the alert that will result from this (perhaps the datacenter or some other metadata), you've just lost them. When you said 'sum(...) by (server)', Prometheus faithfully did what you said; it summed everything by the server and as part of that threw away all other labels, because you told it all that mattered was the 'server' label.

There are two ways around this. The obvious, simple way that you may reach for in your haste to fix this issue is to add the additional metadata label or labels that you care about to the 'by()' expression, so you have, eg, 'sum(...) by (server, datacenter)'. The problem with this is that you're playing whack-a-mole, having to add each additional label to the list of labels as you remember them (or discover problems because they're missing). The better way is to be explicit about what you want to ignore:

sum( pfctl_protocol_entries ) without (proto)

This will automatically pass through all other labels, including ones that you add in six months from now as part of a metrics reorganization (long after you forgot that 'sum(..) by (...)' special case in one of your alert rules).

After this experience, I've come to think that doing aggregation using 'by (...)' in your alert rules (or recording rules) is potentially dangerous and ought to at least be scrutinized carefully and probably commented. Sometimes there are good reasons for it where you want to narrow down to a known set of common labels or the like, but otherwise it is a potential trap even if it works for your setup today.

Written on 19 March 2020.
« Sorting out Go's 'for ... = range ..' and when it copies things
Wishing for a remote resilient server environment (now that it's too late) »

Page tools: View Source.
Search:
Login: Password:

Last modified: Thu Mar 19 23:51:33 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.