One situation where you absolutely can't use
irate() in Prometheus
This is a story about me shooting myself in the foot repeatedly, until I finally noticed.
We have Linux NFS client machines, and we would like to know NFS client performance information about them so we can see things which filesystems they use the most. The Linux kernel provides copious per-mount information on NFS mounts and the Prometheus node exporter can turn all of it into metrics, but for us the raw node exporter stats are just too overwhelming in our environment; a typical machine generates over 68,000 different node_mountstats metrics values. So we wrote our own program to digest the raw mountstats metrics and do things like sum all of the per NFS operation information together for a given mount. We run the program from cron once a minute and publish its information through the node exporter's extremely handy textfile collector.
(The node exporter's current mountstats collector also has the same
misunderstanding about the
xprt: NFS RPC information that I did, and so reports it on
a per-mount basis.)
For a while I've been trying to use the cumulative total RPC time
with the total RPC count to generate an average time per RPC graph.
Over and over I've plugged in some variation on 'cumulative seconds
/ cumulative operations' for various sources of both numbers, put
irate() around both, graphed this, and gotten either nothing
at all or maybe a few points. Today I really dug into this and the
penny finally dropped while I was brute-forcing things. The problem
was that I was reflexively using
irate() instead of
The one situation where you absolutely can't use
irate() is that
you can't apply
irate() to oversampled metrics, metrics
that Prometheus is scraping more often than they're being generated.
We're generating NFS metrics once a minute but scraping the node
exporter every fifteen seconds, so three out of four of our recorded
NFS metrics are the same.
irate() very specifically uses the last two data points in your sample
interval, and when you're oversampling, those two data points are
very often going to be the exact same actual metric and so have no
In other words, a great deal of my graph was missing because it was all 'divide by zero' NaNs, since the irate() of the 'cumulative operations' count often came out as zero.
(Oversampled metric is my term. There may be an official term for 'scraped more often than it's generated' that I should be using instead.)
Looking back, I specifically mentioned this issue in my entry on
irate(), but apparently
it didn't sink in thoroughly enough. Especially I didn't think
through the consequences of dividing by the
irate() of an oversampled
metric, which is where things really go off the rails.
(If you merely directly graph the
irate() of an oversampled metric,
you just get a (very) spiky graph.)
Part of what happened is that when I was writing my exploratory
PromQL I was thinking in terms of how often the metric was generated,
not how often it was scraped, so I didn't even consciously consider
that I had an oversampled metric. For example, I was careful to use
irate(...[3m])', so I would be sure to have at least two valid metric
points for a once a minute metric. This is probably a natural mindset
to have, but that just means it's something I'm going to have to try to
keep in mind for the future for node exporter textfile metrics and
Sidebar: Why I reflexively reach for
rate() is a
choice of convenience and what time range and query step you're
working on. Since the default Prometheus graph is one hour, I'm
almost always exploring with a small enough query step that
irate() is the easier way to go. Even
when I'm dealing with an oversampled metric, I can get sort of get
away with this if I'm directly graphing the
irate() result; it
will just be (very) spiky, it won't be empty or clearly bogus.