I wish Prometheus had some features to deal with 'missing' metrics

March 14, 2021

Prometheus has a reasonable number of features to let you determine things about changes in ongoing metrics. For example, if you want to know how many separate times your Blackbox ICMP pings have started to fail over a time range (as opposed to how frequently they failed), a starting point would be:

changes( probe_success{ probe="icmp" } [1d] )

(The changes() function is not ideal for this; what you would really like is changes_down and changes_up functions.)

But this and similar things only work for metrics (more exactly, time series) that are always present and only have their values change. Many metrics come and go, and right now in Prometheus you can't do changes-like things with them as a result. You can probably get averages over time, but it's at least pretty difficult to get something as simple as a count of how many times an alert fired within a given time interval. As with timestamps for samples, the information necessary is in Prometheus' underlying time series database, but it's not exposed to us.

One starting point would be to expose information that Prometheus already has about time series going stale. As covered in the official documentation on staleness, Prometheus detects most cases of metrics disappearing and puts an explicit marker in the TSDB (although this doesn't handle all cases). But then it doesn't do anything with this marker except not answer queries. Perhaps it would be possible within the existing interfaces to the TSDB to add a count_stale() function that would return a count of how many times a time series for a metric had gone stale within the range.

The flipside is counting or detecting when time series appear. I think this is harder in the current TSDB model, because I don't think there's an explicit marker when a previously not-there time series appears. This means that to know if a time series was new at time X, Prometheus would have to look back up to five minutes (by default) to check for staleness markers and to see if the time series was there. This is possible but would involve more work.

However, I think it's worth finding a solution. It feels frankly embarrassing that Prometheus currently cannot answer basic questions like 'how many times did this alert fire over this time interval'.

(Possibly you can use very clever Prometheus queries with subqueries to get an answer. Subqueries allow you to do a lot of brute force things if you try hard enough, so I can imagine detecting some indirect sign of a just appeared ALERT metric with a subquery.)

Written on 14 March 2021.
« Prometheus and the case of the stuck metrics
Different views of what are basic and advanced Vim features »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 14 00:48:51 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.