2019-03-24
Prometheus's delta()
function can be inferior to subtraction with offset
The PromQL delta()
function is used on gauges to, well, let's quota its help text:
delta(v range-vector)
calculates the difference between the first and last value of each time series element in a range vectorv
, returning an instant vector with the given deltas and equivalent labels. The delta is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if the sample values are all integers.
Given this description, you would expect that 'delta(yourmetric[24h])
'
is preferable to the essentially functionally equivalent but more
verbose version using offset
:
yourmetric - yourmetric offset 24h
(Ignoring some hand waving about any delta extrapolation and so on.)
Unfortunately it is not. In some situations, the offset
based version
can work when the delta()
version fails.
The fundamental problem is unsurprisingly related to Prometheus's
lack of label based optimization,
and it is that using delta()
attempts to load all samples in
the entire range into memory, even though most of them will be
ignored and discarded. If your metric has a lot of metric points,
for example because it has relatively high metric cardinality (many different label
values), attempting to load all of the samples into memory can trip
Prometheus limits and cause the delta()
-based version to fail.
The offset
based version only ever loads metric points from two
times, so it will almost always work.
On the one hand, it's easy to see how Prometheus's implementation
of PromQL
could wind up doing this. It is natural to write general code that
loads range vectors and then have delta()
just call it generically
and ignore most of the result, especially since there are various
special cases. On the other hand, this is a very unfortunate
artificial limit that's probably eventually going to affect any
delta()
query that's made over a sufficiently large timescale.
(This issue doesn't affect rate()
and friends, at least in one
sense. Because rate()
and company have to check for resets over
the entire time range, they need to load and use all of the sample
points. You can't replace an increase()
with an offset
unless
you're willing to ignore any errors caused by counter resets. If
you're doing ad-hoc queries, you probably need to narrow down the
number of metric points you're trying to load by using labels and
so on. And if you really want to know, say, the average interface
bandwidth for a specific network interface over an entire year, you
may be plain out of luck until you put more RAM in your Prometheus
server and increase its query limits.)
Link: What has your microcode done for you lately?
What has your microcode done for you lately? (via) starts out being about the low-level performance of scattered writes on x86 machines but develops into a story where, well, I'll just quote from the summary:
Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.
I found the whole thing fascinating and I feel it deserves a wider audience. It's a bit challenging to follow if you don't already know some of the details of low-level CPU and memory access operation (it casually throws around terms like RFO), but working to understand it was interesting and taught me things, and I quite enjoyed the coverage of the issues involved in scattered write performance.
(Of course one has to speculate that the slowdown on recent microcode is due to either deliberate changes due to all of the speculative execution issues or side effects from those changes.)
A bit more on ZFS's per-pool performance statistics
In my entry on ZFS's per-pool stats, I said:
In terms of Linux disk IO stats, the
*time
stats are the equivalent of theuse
stat, and the*lentime
stats are the equivalent of theaveq
field. There is no equivalent of the Linuxruse
orwuse
fields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.
The discussion of the *lentime
stats in the manpage and the relevant header
file
are very complicated and abstruse. I am sure they make sense to
people for whom the phrase 'a Rieman sum' is perfectly natural,
but I am not such a person.
Having ground through a certain amount of arguments with myself and
experimentation, I now believe that the ZFS *lentime
stats
are functionally equivalent to the Linux ruse
and wuse
fields. They are not quite identical, but you can use them to
make the same sorts of calculations that you can for Linux. In particular, I believe that an almost
completely accurate value for the average service time for ZFS pool
IOs is:
avgtime = (rlentime + wlentime) / (reads + writes)
The important difference between the ZFS *lentime
metrics and
Linux's ruse
and wuse
is that Linux's times include only
completed IOs, while the ZFS numbers also include the running time
for currently outstanding IOs (which are not counted in reads
and
writes
). However, much of the time this is only going to be a
small difference and so the 'average service time' you calculate
will be almost completely right. This is especially true if you're
doing this over a relatively long time span compared to the actual
typical service time, and if there's been lots of IO over that time.
When there is an error, you're going to get an average service time that is higher than it really should be. This is not a terribly bad problem; it's at least not hiding issues by appearing too low.