Prometheus: using gauge-like things as if they were counters
I recently wrote about how I wish ZFS pools kept a persistent count of errors, instead of more or less a current count that you're expected to clear to verify that you've dealt with your problems. I a comment, Daniel Ebdrup Jensen said that this (and other things) was better collected and tracked in an event monitoring system. My first reaction was that you couldn't really do this because a persistent, always accumulating count of errors (or whatever) is what Prometheus really wants to see; in other words, a Prometheus counter. The 'current count of uncleared errors' is what I think of as a Prometheus gauge.
Then I realized that under some circumstances, you can treat
gauge-like things as if they were counters, although weird counters.
The important thing that makes this work is that whenever it
manipulates a counter (well, a range vector), Prometheus looks for
counter resets. Normally, counter resets are infrequent; they happen
when you reboot systems or restart daemons or whatever. But a ZFS
count of current errors that you always deal with by clearing
completely (so that it drops to zero) can be properly handled by
this process of checking for counter resets. When your current error
count of six checksum errors goes to zero checksum errors, Prometheus
will conclude that the counter has reset, and then when it goes up
again, Prometheus functions such as
will properly add the two values together to work out your total
accumulated checksum errors over some time range.
Of course this comes with a bunch of qualifications and limitations. The most obvious qualification is that you always have to use Prometheus functions in order to work out the current cumulative value even over short time ranges; you can never just look at the current value for a 'cumulative recent errors' value. In turn this means that if you want the cumulative value over a long time range, you're going to be loading a lot of samples and the query will likely be slow.
The second issue is that this only works if you always fully clear ZFS errors (or whatever sort of pseudo-counter you have). If the value drops but doesn't go to zero, Prometheus's counter reset code will assume that it was reset to zero and then grew from there. Only Prometheus gauges are allowed to drop; things you want to treat as counters can only be completely reset to zero. The third issue is that this can't handle situations where the accumulated count of whatever should be reset to zero, for example because you replaced a disk that had checksum errors with a new one (that got the same name), and so the new disk's count of total checksum errors should start from zero, not have the previous disk's count added to it.
All in all, using a 'current count of uncleared errors' as a Prometheus counter is a hack. It can work and it's better than nothing, but you're better off having a true counter, one that behaves more like Prometheus expects.
(This is also giving me thoughts about using
increase() on selected
gauge metrics, much as you can use
changes(). For that matter,
perhaps I should be using
changes() on some counters to provide
interesting information, like how many times particular SMART disk
attributes went up.)