Understanding Prometheus' changes()
function and what it can do for me
Recently, roidelapluie wrote an interesting comment on my entry
wishing that Prometheus had features to deal with missing metrics that suggested answering my question
about how many times alerts fired over a time interval with a clever
(or perhaps obvious) use of the changes()
function:
changes(ALERTS_FOR_STATE[1h])+1
When I tried this out, I had one of those 'how does this work' moments
until I thought about it more. To understand why this works as well as
it does, I'll start with the the documentation for changes()
:
For each input time series,
changes(v range-vector)
returns the number of times its value has changed within the provided time range as an instant vector.
If you have a continuous time series, one that has always existed
within the time range, this gives you the number of times that its
value has changed (which is not the same as the number of different
values it's had across that time range). If this is a time series
like the Blackbox's
probe_success
, which is either 0 or 1 depending on whether it
succeeded, this will tell you how many times the probe has changed
states between succeeding and failing.
(To work out how many times the probe has started to fail, it's not
enough to divide changes()
by two; you also need to know what the
probe's state was at the start and the end of the time range.)
If you apply changes()
to a continuous metric where the values
reset every so often, you will get a count of how many times the
values changed and thus how many times there was a value reset. For
instance, if you make DNS SOA queries through Blackbox, you
will get the zone's current serial number back as a probe_dns_serial
metric and changes(probe_dns_serial[1w])
will tell you how
many times you (or someone else) did zone updates over the past
week (well, more or less, this is really only valid for your own
authoritative DNS servers). Similarly, if you want to know how
many times a host rebooted over the past week you can ask for:
changes( node_boot_time_seconds [1w] )
(Well, more or less. There are qualifications if your clocks are changing.)
What this example points out is the value of having a metric with a
value that's fixed when some underlying thing changes (such as the
system booting), instead of changing all of the time. What the Linux
kernel really provides is 'seconds since boot', but if node_exporter directly exposed that it
would change on every scrape and we could not use changes()
this way.
If you apply changes()
to a metric that's sometimes missing, such
as ALERTS
, the missing sections are ignored (the actual code is
literally unaware of them as far as I can tell); what matters is
the sequence of values for time series points that actually exist.
When the time series always has a fixed value when it exists, such
as the fixed ALERTS
value of '1', changes()
will always tell
you that there are 0 changes over the time range for every time
series with points within it. This is because the values of the
time series points are always the same, and changes()
is sadly
blind to the time series appearing and disappearing.
If you apply changes()
to a non-continuous metric where the value
is reset when the time series reappears, you'll get a count that
is one less than the number of times that the time series appears.
This is the situation for ALERTS_FOR_STATE
, where its value
is the starting time of an alert.
If a given alert was triggered only once, there's only one timestamp
value and changes()
will tell you it never changed. If a given
alert was triggered twice, there are two timestamp values and
changes()
will tell you it changed once. And so on.
What all of this biases me towards is exposing some form of fixed timestamp in any situation where I may want to count the number of times something happens. This is probably so even if the underlying data is in the form of a duration ('X seconds ago'), as we saw with host boot times. If I don't have a timestamp, maybe I can come up with some other fixed number instead of just using a '1'. Of course this can be taken too far, since using a fixed '1' value has its own conveniences.
|
|