A potential Prometheus issue for labeled metrics for infrequent events

October 17, 2020

One of the things you often get to do with Prometheus is to design your own custom metrics for things (which may be generated and exposed in a number of ways, for example using mtail to extract them from log files). One piece of advice for designing metrics that I've seen is to group closely related measurements together under one metric name using label values to differentiate between them. The classical example here is counting HTTP requests with different return codes using one metric name with a label for the return code, instead of something like one metric name for 200-series responses, a second one for 400-series responses, and so on.

(One of the advantages of using a single metric with label values instead of multiple metrics is that it's easier to do operations across all of the different versions of the metric that way. If you want the count of all hits, you can just do 'sum(http_requests) without(status)', instead of having to manually add several metrics together.)

However, using labels this way can create an issue if your metric name is for events that occur only rarely. When something happens that resets your group of metrics (such as the machine reboots), you can wind up in a situation where you've seen no events so you have no time series at all for the metric name, never mind having time series for all of the labels that you expect to eventually see. If there are no time series at all, operations like 'sum()', 'rate()', and so on will fail (well, give no answers), which can potentially make it awkward to do dashboards and graphs that use this metric.

(Your dashboards will usually work, once enough time and events have gone by so that the metric name is populated with labels for everything that you expect.)

The advantage of separate metric names without labels is that most exporter implementations will naturally give them 0 values when the exporter restarts. These 0 values insure that the metric is always present, so you can easily build reliable graphs that use it and so on. There are various standard solutions for potentially missing data in Prometheus queries (see here), but they all make the PromQL expression less pleasant.

As covered in Brian Brazil's Existential issues with metrics, if you know the labels that will eventually be present and you can, it's ideal to pre-create them ahead of time so that they always exist with a value of 0. Unfortunately not all metrics creation environments provide obvious support for this. Even if you can coerce the metrics creation environment into doing it, it's easy to miss the issue when you're setting things up to start with.

(This came up when I was augmenting our mtail Exim logfile parsing to count some infrequent spam-related events. Mtail can probably be pushed to do this, but it's not straightforward.)

Written on 17 October 2020.
« Go is gaining the ability to trace init calls on program startup
We need to start getting some experience with using Ubuntu 20.04 »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Oct 17 00:19:36 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.