I wish Prometheus had some features to deal with 'missing' metrics

March 14, 2021

Prometheus has a reasonable number of features to let you determine things about changes in ongoing metrics. For example, if you want to know how many separate times your Blackbox ICMP pings have started to fail over a time range (as opposed to how frequently they failed), a starting point would be:

changes( probe_success{ probe="icmp" } [1d] )

(The changes() function is not ideal for this; what you would really like is changes_down and changes_up functions.)

But this and similar things only work for metrics (more exactly, time series) that are always present and only have their values change. Many metrics come and go, and right now in Prometheus you can't do changes-like things with them as a result. You can probably get averages over time, but it's at least pretty difficult to get something as simple as a count of how many times an alert fired within a given time interval. As with timestamps for samples, the information necessary is in Prometheus' underlying time series database, but it's not exposed to us.

One starting point would be to expose information that Prometheus already has about time series going stale. As covered in the official documentation on staleness, Prometheus detects most cases of metrics disappearing and puts an explicit marker in the TSDB (although this doesn't handle all cases). But then it doesn't do anything with this marker except not answer queries. Perhaps it would be possible within the existing interfaces to the TSDB to add a count_stale() function that would return a count of how many times a time series for a metric had gone stale within the range.

The flipside is counting or detecting when time series appear. I think this is harder in the current TSDB model, because I don't think there's an explicit marker when a previously not-there time series appears. This means that to know if a time series was new at time X, Prometheus would have to look back up to five minutes (by default) to check for staleness markers and to see if the time series was there. This is possible but would involve more work.

However, I think it's worth finding a solution. It feels frankly embarrassing that Prometheus currently cannot answer basic questions like 'how many times did this alert fire over this time interval'.

(Possibly you can use very clever Prometheus queries with subqueries to get an answer. Subqueries allow you to do a lot of brute force things if you try hard enough, so I can imagine detecting some indirect sign of a just appeared ALERT metric with a subquery.)


Comments on this page:

By dozzie at 2021-03-14 13:52:20:

You can probably get averages over time, but it's at least pretty difficult to get something as simple as a count of how many times an alert fired within a given time interval.

Because it's a different type of data. Alerts firing are not metrics, but events (just like logs), and Prometheus is a metrics storage. It's really not a surprise it's ill-suited for other types of data.

By cks at 2021-03-14 18:10:45:

I picked alerts because they show up in Prometheus' metrics storage, primarily as the ALERTS metric. In addition many of the questions you want to ask about them are what I consider metrics-like instead of logs-like.

(This is true of a lot of logs in general; we ask a lot of 'how many times' and 'what different <X>s do we see in' and so on questions about many different sorts of logs.)

By dozzie at 2021-03-15 11:48:07:

Of course, logs and metrics can be quite similar in a bunch of aspects, otherwise it wouldn't be as easy to shoehorn one into storage for the other. Yet, they are still separate data types, and unless the storage was prepared specifically for the given type, it will only work with it so-so, as you noticed yourself.

By roidelapluie at 2021-03-31 10:28:52:

Hello,

However, I think it's worth finding a solution. It feels frankly embarrassing that Prometheus currently cannot answer basic questions like 'how many times did this alert fire over this time interval'.

You can answer this question with the following query:

changes(ALERTS_FOR_STATE[1h])+1

By cks at 2021-03-31 10:54:59:

This is a better attempt than I expected, but unfortunately it doesn't work either. This will tell you how many alerts started to trigger over a time interval, but it won't tell you how many actually fired because the ALERTS_FOR_STATE metric doesn't have a label for whether the alert is pending or firing.

I hadn't realized that changes() worked quite this way for time series with gaps in them, and it's a useful thing to know. It's possible this will change what sort of metrics I generate for some things, since it's clearly useful to have different values for anything I want to count.

By roidelapluie at 2021-03-31 18:42:54:

Yeah in this case it seems the "correct" answer is with subqueries as you expected.

 changes((ALERTS_FOR_STATE and ignoring (alertstate) ALERTS{alertstate="firing"})[1h:])+1
By roidelapluie at 2021-04-01 01:58:49:

Regarding your request about changes_up/changes_down: while we do not have changes_up, resets() is the equivalent of changes_down().

Written on 14 March 2021.
« Prometheus and the case of the stuck metrics
Different views of what are basic and advanced Vim features »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 14 00:48:51 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.