A potential Prometheus issue for labeled metrics for infrequent events
One of the things you often get to do with Prometheus is to design your own custom metrics for things (which may be generated and exposed in a number of ways, for example using mtail to extract them from log files). One piece of advice for designing metrics that I've seen is to group closely related measurements together under one metric name using label values to differentiate between them. The classical example here is counting HTTP requests with different return codes using one metric name with a label for the return code, instead of something like one metric name for 200-series responses, a second one for 400-series responses, and so on.
(One of the advantages of using a single metric with label values
instead of multiple metrics is that it's easier to do operations
across all of the different versions of the metric that way. If you
want the count of all hits, you can just do 'sum(http_requests)
without(status)
', instead of having to manually add several metrics
together.)
However, using labels this way can create an issue if your metric
name is for events that occur only rarely. When something happens
that resets your group of metrics (such as the machine reboots),
you can wind up in a situation where you've seen no events so you
have no time series at all for the metric name, never mind having
time series for all of the labels that you expect to eventually
see. If there are no time series at all, operations like 'sum()
',
'rate()
', and so on will fail (well, give no answers), which can
potentially make it awkward to do dashboards and graphs that use
this metric.
(Your dashboards will usually work, once enough time and events have gone by so that the metric name is populated with labels for everything that you expect.)
The advantage of separate metric names without labels is that most exporter implementations will naturally give them 0 values when the exporter restarts. These 0 values insure that the metric is always present, so you can easily build reliable graphs that use it and so on. There are various standard solutions for potentially missing data in Prometheus queries (see here), but they all make the PromQL expression less pleasant.
As covered in Brian Brazil's Existential issues with metrics, if you know the labels that will eventually be present and you can, it's ideal to pre-create them ahead of time so that they always exist with a value of 0. Unfortunately not all metrics creation environments provide obvious support for this. Even if you can coerce the metrics creation environment into doing it, it's easy to miss the issue when you're setting things up to start with.
(This came up when I was augmenting our mtail Exim logfile parsing to count some infrequent spam-related events. Mtail can probably be pushed to do this, but it's not straightforward.)
|
|