Understanding Prometheus' changes() function and what it can do for me

March 31, 2021

Recently, roidelapluie wrote an interesting comment on my entry wishing that Prometheus had features to deal with missing metrics that suggested answering my question about how many times alerts fired over a time interval with a clever (or perhaps obvious) use of the changes() function:

changes(ALERTS_FOR_STATE[1h])+1

When I tried this out, I had one of those 'how does this work' moments until I thought about it more. To understand why this works as well as it does, I'll start with the the documentation for changes():

For each input time series, changes(v range-vector) returns the number of times its value has changed within the provided time range as an instant vector.

If you have a continuous time series, one that has always existed within the time range, this gives you the number of times that its value has changed (which is not the same as the number of different values it's had across that time range). If this is a time series like the Blackbox's probe_success, which is either 0 or 1 depending on whether it succeeded, this will tell you how many times the probe has changed states between succeeding and failing.

(To work out how many times the probe has started to fail, it's not enough to divide changes() by two; you also need to know what the probe's state was at the start and the end of the time range.)

If you apply changes() to a continuous metric where the values reset every so often, you will get a count of how many times the values changed and thus how many times there was a value reset. For instance, if you make DNS SOA queries through Blackbox, you will get the zone's current serial number back as a probe_dns_serial metric and changes(probe_dns_serial[1w]) will tell you how many times you (or someone else) did zone updates over the past week (well, more or less, this is really only valid for your own authoritative DNS servers). Similarly, if you want to know how many times a host rebooted over the past week you can ask for:

changes( node_boot_time_seconds [1w] )

(Well, more or less. There are qualifications if your clocks are changing.)

What this example points out is the value of having a metric with a value that's fixed when some underlying thing changes (such as the system booting), instead of changing all of the time. What the Linux kernel really provides is 'seconds since boot', but if node_exporter directly exposed that it would change on every scrape and we could not use changes() this way.

If you apply changes() to a metric that's sometimes missing, such as ALERTS, the missing sections are ignored (the actual code is literally unaware of them as far as I can tell); what matters is the sequence of values for time series points that actually exist. When the time series always has a fixed value when it exists, such as the fixed ALERTS value of '1', changes() will always tell you that there are 0 changes over the time range for every time series with points within it. This is because the values of the time series points are always the same, and changes() is sadly blind to the time series appearing and disappearing.

If you apply changes() to a non-continuous metric where the value is reset when the time series reappears, you'll get a count that is one less than the number of times that the time series appears. This is the situation for ALERTS_FOR_STATE, where its value is the starting time of an alert. If a given alert was triggered only once, there's only one timestamp value and changes() will tell you it never changed. If a given alert was triggered twice, there are two timestamp values and changes() will tell you it changed once. And so on.

What all of this biases me towards is exposing some form of fixed timestamp in any situation where I may want to count the number of times something happens. This is probably so even if the underlying data is in the form of a duration ('X seconds ago'), as we saw with host boot times. If I don't have a timestamp, maybe I can come up with some other fixed number instead of just using a '1'. Of course this can be taken too far, since using a fixed '1' value has its own conveniences.

Written on 31 March 2021.
« Systemd's NSS myhostname module surprised me recently
Programs that read IPMI sensors tell you subtly different things »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 31 23:09:51 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.