Wandering Thoughts archives


Prometheus's delta() function can be inferior to subtraction with offset

The PromQL delta() function is used on gauges to, well, let's quota its help text:

delta(v range-vector) calculates the difference between the first and last value of each time series element in a range vector v, returning an instant vector with the given deltas and equivalent labels. The delta is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if the sample values are all integers.

Given this description, you would expect that 'delta(yourmetric[24h])' is preferable to the essentially functionally equivalent but more verbose version using offset:

yourmetric - yourmetric offset 24h

(Ignoring some hand waving about any delta extrapolation and so on.)

Unfortunately it is not. In some situations, the offset based version can work when the delta() version fails.

The fundamental problem is unsurprisingly related to Prometheus's lack of label based optimization, and it is that using delta() attempts to load all samples in the entire range into memory, even though most of them will be ignored and discarded. If your metric has a lot of metric points, for example because it has relatively high metric cardinality (many different label values), attempting to load all of the samples into memory can trip Prometheus limits and cause the delta()-based version to fail. The offset based version only ever loads metric points from two times, so it will almost always work.

On the one hand, it's easy to see how Prometheus's implementation of PromQL could wind up doing this. It is natural to write general code that loads range vectors and then have delta() just call it generically and ignore most of the result, especially since there are various special cases. On the other hand, this is a very unfortunate artificial limit that's probably eventually going to affect any delta() query that's made over a sufficiently large timescale.

(This issue doesn't affect rate() and friends, at least in one sense. Because rate() and company have to check for resets over the entire time range, they need to load and use all of the sample points. You can't replace an increase() with an offset unless you're willing to ignore any errors caused by counter resets. If you're doing ad-hoc queries, you probably need to narrow down the number of metric points you're trying to load by using labels and so on. And if you really want to know, say, the average interface bandwidth for a specific network interface over an entire year, you may be plain out of luck until you put more RAM in your Prometheus server and increase its query limits.)

sysadmin/PrometheusDeltaVsOffset written at 18:58:22; Add Comment

Link: What has your microcode done for you lately?

What has your microcode done for you lately? (via) starts out being about the low-level performance of scattered writes on x86 machines but develops into a story where, well, I'll just quote from the summary:

Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.

I found the whole thing fascinating and I feel it deserves a wider audience. It's a bit challenging to follow if you don't already know some of the details of low-level CPU and memory access operation (it casually throws around terms like RFO), but working to understand it was interesting and taught me things, and I quite enjoyed the coverage of the issues involved in scattered write performance.

(Of course one has to speculate that the slowdown on recent microcode is due to either deliberate changes due to all of the speculative execution issues or side effects from those changes.)

links/WhatHasMicrocodeDone written at 13:21:11; Add Comment

A bit more on ZFS's per-pool performance statistics

In my entry on ZFS's per-pool stats, I said:

In terms of Linux disk IO stats, the *time stats are the equivalent of the use stat, and the *lentime stats are the equivalent of the aveq field. There is no equivalent of the Linux ruse or wuse fields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.

The discussion of the *lentime stats in the manpage and the relevant header file are very complicated and abstruse. I am sure they make sense to people for whom the phrase 'a Rieman sum' is perfectly natural, but I am not such a person.

Having ground through a certain amount of arguments with myself and experimentation, I now believe that the ZFS *lentime stats are functionally equivalent to the Linux ruse and wuse fields. They are not quite identical, but you can use them to make the same sorts of calculations that you can for Linux. In particular, I believe that an almost completely accurate value for the average service time for ZFS pool IOs is:

avgtime = (rlentime + wlentime) / (reads + writes)

The important difference between the ZFS *lentime metrics and Linux's ruse and wuse is that Linux's times include only completed IOs, while the ZFS numbers also include the running time for currently outstanding IOs (which are not counted in reads and writes). However, much of the time this is only going to be a small difference and so the 'average service time' you calculate will be almost completely right. This is especially true if you're doing this over a relatively long time span compared to the actual typical service time, and if there's been lots of IO over that time.

When there is an error, you're going to get an average service time that is higher than it really should be. This is not a terribly bad problem; it's at least not hiding issues by appearing too low.

solaris/ZFSPerPoolStatsII written at 00:49:53; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.