In Prometheus, it's hard to work with when metric points happened

December 15, 2020

I generally like Prometheus and PromQL, its query language (which you have to do basically everything through). But there is one frustrating exception that I routinely run into, and that is pretty much anything to do with when a metric point happened.

PromQL is great about letting you deal with the value of metrics. But the value is only half of a metric, because each sample (what I usually call a metric point) are a value plus a timestamp (cf). Prometheus stores the timestamp as well as the value, but it has relatively little support for accessing and manipulating the timestamps, much less than it does for the values. However there are plenty of times when the timestamp is at least as interesting as the value. One example is finding out the most recent time that a metric existed, which is useful for all sorts of reasons.

When I started writing this entry (and when I wrote those tweets), I was going to say that Prometheus makes it unnecessarily hard and should fix the problem. But the more I think about it, that's only true for simple cases. In more complex cases you need to conditionally select metric points by their value, and at that point Prometheus has no simple solution. In the current state of Prometheus, any 'evaluation over time' that needs something more sophisticated than matching labels must use Prometheus subqueries, and that holds whether it's using the value of the metric point or the time it happened at.

Prometheus makes the simple case slightly harder because timestamp() is a function, which means that anything using it must be written as a subquery. But that only hurts you with relatively simple queries, like finding the most recent time a metric existed:

max_over_time( timestamp( vpn_users{user="fred"} )[4w:] )

Even without this you'd have to put the '[4w]' somewhere (although having it as a subquery may be slightly less efficient in some cases).

Now suppose that you want to know the most recent time that pings failed for each of your Blackbox ICMP probes. This requires a conditional check, but the query can be almost the same as before through the magic of Prometheus operators being set operations:

max_over_time( timestamp( probe_success{probe="icmp"} == 0 )[24h:] )

What makes some cases of dealing with values over time in PromQL simpler (for example working how frequently pings fail) isn't any feature in PromQL itself, it's that those metric values are carefully arranged so that you can do useful things with aggregation over time. Timestamps are fundamentally not amenable to this sort of manipulation, but they're far from alone; many metrics have values that you need to filter or process.

(You can get the earliest time instead of the latest time with min_over_time. Other aggregation over time operators are probably not all that useful with timestamps.)

Written on 15 December 2020.
« Chrome is getting its own set of Certificate Authority roots
How to make Grafana properly display a Unix timestamp »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 15 00:12:22 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.