Wandering Thoughts


A pattern for dealing with missing metrics in Prometheus in simple cases

Previously, I mentioned that Prometheus expressions are filters, which is part of Prometheus having a generally set-oriented view of the world. One of the consequences of this view is that you can quite often have expressions that give you a null result when you really want the result to be 0.

For example, let's suppose that you want a Grafana dashboard that includes a box that tells you how many Prometheus alerts are currently firing. When this happens, Prometheus exposes an ALERTS metric for each active alert, so on the surface you would count these up with:

count( ALERTS{alertstate="firing"} )

Then one day you don't have any firing alerts and your dashboard's box says 'N/A' or 'null' instead of the '0' that you want. This happens because 'ALERTS{alertstate="firing"}' matches nothing, so the result is a null set, and count() of a null set is a null result (or, technically, a null set).

The official recommended practice is to not have any metrics and metric label values that come and go; all of your metrics and label sets should be as constant as possible. As you can tell with the official Prometheus ALERTS metric, not even Prometheus itself actually fully follows this, so we need a way to deal with it.

My preferred way of dealing with this is to use 'or vector(0)' to make sure that I'm never dealing with a null set. The easiest thing to use this with is sum():

sum( ALERTS{alertstate="firing"} or vector(0) )

Using sum() has the useful property that the extra vector(0) element has no effect on the result. You can often use sum() instead of count() because many sporadic metrics have the value of '1' when they're present; it's the accepted way of creating what is essentially a boolean 'I am here' metric such as ALERTS.

If you're filtering for a specific value or value range, you can still use sum() instead of count() by using bool on the comparison:

sum( node_load1 > bool 10 or vector(0) )

If you're counting a value within a range, be careful where you put the bool; it needs to go on the last comparison. Eg:

sum( node_load1 > 5 < bool 10 or vector(0) )

If you have to use count() for more complicated reasons, the obvious approach is to subtract 1 from the result.

Unfortunately this approach starts breaking down rapidly when you want to do something more complicated. It's possible to compute a bare average over time using a subquery:

avg_over_time( (sum( ALERTS{alertstate="firing"} or vector(0) ))[6h:] )

(Averages over time of metrics that are 0 or 1, like up, are the classical way of figuring out things like 'what percentage of the time is my service down'.)

However I don't know how to do this if you want something like an average over time by alert name or by hostname. In both cases, even alerts that were present some of the time were not present all of the time, and they can't be filled in with 'vector(0)' because the labels don't match (and can't be made to match). Nor do I know of a good way to get the divisor for a manual averaging. Perhaps you would want to do an unnecessary subquery so you can exactly control the step and thus the divisor. This would be something like:

sum_over_time( (sum( ALERTS{alertstate="firing"} ) by (alertname))[6h:1m] ) / (6*60)

Experimentation suggests that this provides plausible results, at least. Hopefully it's not too inefficient. In Grafana, you need to write the subquerry as '[$__range:1m]' but the division as '($__range_s / 60)', because the Grafana template variable $__range includes the time units.

(See also Existential issues with metrics.)

PrometheusMissingMetricsPattern written at 00:39:58; Add Comment


Remembering that Prometheus expressions act as filters

In conventional languages, comparisons like '>' and other boolean operations like 'and' give you implicit or explicit boolean results. Sometimes this is a pseudo-boolean result; in Python if you say 'A and B', you famously get either False or the value of B as the end result (instead of True). However, PromQL doesn't work this way. As I keep having to remember over and over, in Prometheus, comparisons and other boolean operators are filters.

In PromQL, when you write 'some_metric > 10', what happens is that first Prometheus generates a full instant vector for some_metric, with all of the metric points and their labels and their values, and then it filters out any metric point in the instant vector where the value isn't larger than 10. What you have left is a smaller instant vector, but all of the values of the metric points in it are their original ones.

The same thing happens with 'and'. When you write 'some_metric and other_metric', the other_metric is used only as a filter; metric points from some_metric are only included in the result set if there is the same set of labels in the other_metric instant vector. This means that the values of other_metric are irrelevant and do not propagate into the result.

The large scale effect of this is that the values that tend to propagate through your rule expression are whatever started out as the first metric you looked at (or whatever arithmetic you perform on them). Sometimes, especially in alert rules, this can bias you toward putting one condition in front of the other. For instance, suppose that you want to trigger an alert when the one-minute load average is above 20 and the five-minute load average is above 5, and you write the alert rule as:

expr: (node_load5 > 5) and (node_load1 > 20)

The value available in the alert rule and your alert messages is the value of node_load5, not node_load1, because node_load5 is what you started out the rule with. If you find the value of node_load1 more useful in your alert messages, you'll want to flip the order of these two clauses around.

As the PromQL documentation covers, you can turn comparison operations from filters into pseudo-booleans by using 'bool', as in 'some_metric > bool 10'. As far as I know, there is no way to do this with 'and', which always functions as a filter, although you can at least select what labels have to match (or what labels to ignore).

PS: For some reason I keep forgetting that 'and', 'or', and 'unless' can use 'on' and 'ignoring' to select what labels you care about. What you can't do with them, though, is propagate some labels from the right side into the result; if you need that, you have to use 'group_left' or 'group_right' and figure out how to re-frame your operation so that it involves a comparison, since 'and' and company don't work with grouping.

(I was going to confidently write an entry echoing something that I said on the Prometheus users mailing list recently, but when when I checked the documentation and performed some tests, it turned out I was wrong about an important aspect of it. So this entry is rather smaller in scope, and is written mostly to get this straight in my head since I keep forgetting the details of it.)

PrometheusExpressionsFilter written at 23:59:31; Add Comment


Why selecting times is still useful even for dashboards that are about right now

In the aftermath of our power outage, one of the things that I did was put together a Grafana dashboard that was specifically focused on dealing with large scale issues, specifically a lot of machines being down or having problems. In this sort of situation, we don't need to see elaborate status displays and state information; basically we want a list of down machines and a list of other alerts, and very little else to get in the way.

(We have an existing overview dashboard, but it's designed with the tacit assumption that only a few or no machines are down and we want to see a lot of other state information. This is true in our normal situation, but not if we're going through a power shutdown or other large scale event.)

This dashboard will likely only ever be used in production displaying the current time, because 'what is (still) wrong right now' is its entire purpose. Yet when I built it, I found that I not only wanted to leave in the normal Grafana time setting options but specifically build in a panel that would let me easily narrow in on a specific (end) time. This is because setting the time to a specific point is extremely useful for development, testing, and demos of your dashboard. In my case, I could set my in-development dashboard back to a point during our large scale power outage issues and ask myself whether what I was seeing was useful and complete, or whether it was annoying and missing things we'd want to know.

(And also test that the queries and Grafana panel configurations and so on were producing the results that I expected and needed.)

This is obviously especially useful for dashboards that are only interesting in exceptional conditions, conditions that you hopefully don't see all the time and can't find on demand. We don't have large scale issues all that often, so if I want to see and test my dashboard during one before the next issue happens I need to rewind time and set it at a point where the last crisis was happening.

(Now that I've written this down it all feels obvious, but it initially wasn't when I was staring at my dashboard at the current time, showing nothing because nothing was down, and wondering how I was going to test it.)

Sidebar: My best time-selection option in Grafana

In my experience, the best way to select a time range or a time endpoint in Grafana is through a graph panel that shows something over time. What you show doesn't matter, although you might as well try to make it useful; what you really care about is the time scale at the bottom that lets you swipe and drag to pick the end and start points of the time range. The Grafana time selector at the top right is good for the times that it gives fast access to, but it is slow and annoying if you want, say, '8:30 am yesterday'. It is much faster to use the time selector to get your graph so that it includes the time point you care about, then select it off the graph.

DashboardSetTimeUseful written at 22:45:30; Add Comment


It's always DNS (a story of our circular dependency)

Our building and in fact much of the University of Toronto downtown campus had a major power failure tonight. When power came back on I wasn't really expecting our Ubuntu servers to come back online, but to my surprise they started pinging (which meant not just that the actual servers were booting but that the routers, the firewall, the switches, and so on had come back). However when I started ssh'ing in, our servers were not in a good state. For a start, I didn't have a home directory, and in fact none of our NFS filesystems were mounted and the machines were only part-way through boot, stalled trying to NFS mount our central administrative filesystem.

My first thought was that our fileservers had failed to boot up, either our new Linux ones or our old faithful OmniOS ones, but when I checked they were mostly up. Well, that's getting ahead of things, because when I started to check what actually happened is that the system I was logged in to reported something like 'cannot resolve host <X>'. That would be a serious problem.

(I could resolve our hostnames from an outside machine, which turned out to be very handy since I needed some way to get their IPs so I could log into them.)

We have a pair of recursive OpenBSD-based resolvers; they had booted and could resolve external names, but they couldn't resolve any of our own names. Our configuration uses Unbound backed by NSD, where the NSD on each resolver is supposed to hold a cached copy of our local zones that is refreshed from our private master. In past power shutdowns, this has allowed the resolvers to boot and serve DNS data from our zones even without the private master being up, but this time around it didn't; both NSDs returned SERVFAIl when queried and in 'nsd-control zonestatus' reported things like:

zone: <our-zone>
      state: refreshing
      served-serial: none
      commit-serial: none

Our private master was up, but like everything else it was stalled trying to NFS mount our central administrative filesystem. Since this central filesystem is where our nameserver data lives, this was a hard dependency. This NFS mount turned out to be stalled for two reasons. The obvious and easy to deal with one was that the private master couldn't resolve the hostname of the NFS fileserver. When I tried to mount by IP address, I found the second one; the fileserver itself was refusing mounts because, without working DNS, it couldn't map IP addresses to names to verify NFS mount permission.

(To break this dependency I wound up adding NFS export permission for the IP address of the private master, then manually mounting the filesystem from the fileserver's IP on the private master. This let the boot continue, our private master's nameserver started, our local resolvers could refresh their zones from it, and suddenly internal DNS resolution started working for everyone. Shortly afterward, everyone could at least get the central administrative filesystem mounted.)

So, apparently it really always is DNS, even when you think it won't be and you've tried to engineer things so that your DNS will always work (and when it's worked right in the past).

OurDNSCircularDependency written at 01:42:40; Add Comment


Our likely ZFS fileserver upgrade plans (as of March 2019)

Our third generation of ZFS fileservers are now in full production, although we're less than half way through migrating all of our filesystems from our second generation fileservers. As peculiar as it sounds, this makes me think ahead to what our likely upgrade plans are.

Our current generation ZFS fileservers are running Ubuntu 18.04 LTS with the Ubuntu version of ZFS (with a frozen kernel version). Given our past habits, it's unlikely that we'll want to upgrade them to Ubuntu 20.04 LTS when that comes out in a year or so, unless there's some important ZFS bugfix or feature that's present in 20.04 (which is possible, cf, although serious bugs will hopefully be fixed in the 18.04 version of ZFS). Instead, we'll only start looking at upgrades when 18.04 goes on its end of life countdown when Ubuntu 22.04 LTS comes out, which historically will be in April of 2022, three years from now.

In 2022, our current server hardware and 2TB data SSDs will be about four years old; based on our past habits, this will not be old enough that we consider them in urgent need of replacement. I hope that we'll turn over the SSDs for new ones with larger capacity (and without four years of write wear), but we might not do it in 2022 at the same time as we execute an upgrade to 22.04. If we have money, we might refresh the servers with new hardware, but if so I think we'd mostly be doing it to have hardware that hadn't been used for four years, instead of more powerful hardware, and in general our SuperMicro servers have been very reliable; our OmniOS generation are now somewhere around five years old and show no signs of problems anywhere. The one exception is that maybe RAM prices will finally have gone down substantially by 2022 so we can afford to put a lot more memory in a new generation of servers.

(We will definitely be upgrading from Ubuntu 18.04 when it starts going out of support, and it's probable that it will be to the current Ubuntu LTS instead of to, say, CentOS. Hardware upgrades are much more uncertain.)

Frankly, next time around I would like us not to have to move our ZFS pools and filesystems over to new fileservers; it takes a lot of work and a lot of time. An 'in place' upgrade for the ZFS pools is now at least possible and I hope that we do it, either by reusing the current servers and swapping in new system disks set up with Ubuntu 22.04, or by moving the data SSDs from one physical server to another and then re-importing the pools and so on.

(We did a 'swap the system disks' upgrade on our OmniOS fileservers when we moved from r151010 to r151014 and it went okay. It turns out that we also did this for a Solaris 10 upgrade many years ago.)

ZFSFileserverUpgradePlans written at 21:47:49; Add Comment


Our current approach for significantly upgrading or modifying servers

Every so often we need to make some significant upgrade or change to one of our servers, for instance to upgrade from Ubuntu version to Ubuntu version. When we do this, we do two things. The first is that we reinstall from scratch rather than try to upgrade the machine's current OS and setup in place. There are a whole bunch of reasons for this (for any OS, not just Linux), including that it gets you as close as possible to insuring that the current state of the machine isn't dependent on its history.

(A machine that has been through major upgrades inevitably and invariably carries at least some traces of its past, traces that will not be there on a new version that was reinstalled from scratch.)

The second is that we almost always install the new instance of the server on new hardware and swap it into place, rather than reinstalling on the same hardware that is currently the live server. There are exceptions, usually for our generic compute servers, but for anything important we prefer new hardware (this is somewhat of a change from the past). One part of this is that using a new set of hardware makes it easy to refresh the hardware, change the RAM or SSD setup, and so on (and also to put the new server in a different place in your racks). Another part is that when you have two servers, rolling back an upgrade that turns out to have problems is much easier and faster than if you have destroyed the old server in the process of installing the new one. A third reason is more prosaic; there's always less downtime involved in a machine swap than in a reinstall from scratch, and among other things this leads to less or no pressure when you're installing the machine.

One consequence of our approach is that we always have a certain amount of 'not in production' replaced servers that are still in our racks but powered off and disconnected. We don't pull replaced servers immediately, in case we have to roll back to them, so after a while we have to remember that probably we should pull the old version of an upgraded server. We don't always, so every so often we basically wind up weeding our racks, pulling old servers that don't need to be there. One trigger for this weeding is when we need room in a specific rack and it happens to be cluttered up with obsolete servers. Another is when we run short on spare server hardware to turn into more new servers.

(Certain sorts of servers are recycled almost immediately in order to reclaim desirable bits of hardware in them. For example, right now anything with a 10G-T card is probably going to be pulled shortly after an upgrade in order to extract the card, because we don't have too many of them. There was a time when SSDs would have prompted recycling, but not any more.)

PS: We basically never throw out (still) working servers, even very old ones, but they do get less and less desirable over time and so sit deeper and deeper in the depths of our spare hardware storage. The current fate of really old servers is mostly to be loaned or passed on to other people here who need them and who don't mind getting decade old hardware (often with very little RAM by modern standards, which is another reason they get less desirable over time).

PPS: I'm not joking about decade old servers. We recently passed some Dell 1950s on to someone who needed scratch machines.

ServerUpgradeApproach written at 21:48:13; Add Comment


Prometheus's delta() function can be inferior to subtraction with offset

The PromQL delta() function is used on gauges to, well, let's quota its help text:

delta(v range-vector) calculates the difference between the first and last value of each time series element in a range vector v, returning an instant vector with the given deltas and equivalent labels. The delta is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if the sample values are all integers.

Given this description, you would expect that 'delta(yourmetric[24h])' is preferable to the essentially functionally equivalent but more verbose version using offset:

yourmetric - yourmetric offset 24h

(Ignoring some hand waving about any delta extrapolation and so on.)

Unfortunately it is not. In some situations, the offset based version can work when the delta() version fails.

The fundamental problem is unsurprisingly related to Prometheus's lack of label based optimization, and it is that using delta() attempts to load all samples in the entire range into memory, even though most of them will be ignored and discarded. If your metric has a lot of metric points, for example because it has relatively high metric cardinality (many different label values), attempting to load all of the samples into memory can trip Prometheus limits and cause the delta()-based version to fail. The offset based version only ever loads metric points from two times, so it will almost always work.

On the one hand, it's easy to see how Prometheus's implementation of PromQL could wind up doing this. It is natural to write general code that loads range vectors and then have delta() just call it generically and ignore most of the result, especially since there are various special cases. On the other hand, this is a very unfortunate artificial limit that's probably eventually going to affect any delta() query that's made over a sufficiently large timescale.

(This issue doesn't affect rate() and friends, at least in one sense. Because rate() and company have to check for resets over the entire time range, they need to load and use all of the sample points. You can't replace an increase() with an offset unless you're willing to ignore any errors caused by counter resets. If you're doing ad-hoc queries, you probably need to narrow down the number of metric points you're trying to load by using labels and so on. And if you really want to know, say, the average interface bandwidth for a specific network interface over an entire year, you may be plain out of luck until you put more RAM in your Prometheus server and increase its query limits.)

PrometheusDeltaVsOffset written at 18:58:22; Add Comment


Sometimes the simplest version of a graph is a text table

In the past, I've written about learning that sometimes the best way to show information is in a simple graph, for example a basic bar graph of total change instead of a line graph over time. As a co-worker gently encouraged me recently, we can take this further; sometimes the simplest and most accessible form of information is in the form of a plain table of text or sometime similar to it (eg, as a list).

(The specific situation that prompted this was wanting a simple, easy to read dashboard of ZFS filesystems and pools that are currently more or less full up to their quota limit, especially ones that have recently filled up, because sometimes this causes our fileservers to become upset. We sort of had this information in our existing Grafana dashboards, but a lot of it was in line and bar graphs and so was not the easiest thing to glance at in the heat of the moment.)

I won't say that text is always the simplest and best version of information, because I think it depends on what you want out of it. If you want to clearly read what is essentially textual information, such as the names of full filesystems, then the text format is going to win; even if the information is there in a graph, it's there in labels or things you have to hover over, not the primary visual elements (the lines or bars or points). On the other hand, I think that our bar graphs make it easier to compare the magnitude of things than seeing the same values in text. It's very easy to eyeball a bar graph and see 'that is much bigger than that'; doing the same thing with numbers requires reading the numbers and perhaps interpreting the units (if, for example, we are being helpful by using 'Mbytes' for small numbers and 'Gbytes' for large ones, and so on).

(But if you want to know relatively precisely how much bigger, text is likely to be better. Human beings are good at telling 'smaller' and 'larger', but we are relatively bad at precise measurements of how much. For that matter, there are optical illusions that can fool us on smaller and larger, but hopefully you aren't putting optical illusions in your dashboards.)

The corollary of this is that at some point, I should think about what my dashboards want to say and what information people will want to get from them. I'm not going to say that I should design all of this up front, because right now I don't know enough about what sort of information is even going to be useful to us to do that, but at some point in the design of any particular dashboard I should switch from exploring possibilities to boiling it down to a focused version that's intended for other people.

(If there are elements of an experimental dashboard that are pulling in different directions about what they want to be, I can always make two (or more) production dashboards. Unfortunately Grafana doesn't make this very easy; it's hard to clone dashboards or copy dashboard elements from dashboard to dashboard.)

SimpleTextVsGraphs written at 21:26:30; Add Comment


Prometheus subqueries pick time points in a surprising way

Up until today, I would have confidently told you that I understood how Prometheus subqueries picked the time points that they evaluated your expression at; it was the obvious combination of a range vector with a query step. Given a subquery range such as '[5h:1m]' and assuming an instant query evaluated at 'now', Prometheus would first go back exactly five hours in seconds, as it would for a range vector of '[5h]', and then step forward from that starting time ever one minute (in seconds), as it would in a Prometheus graphing query (what the HTTP API calls a range query). In fact I did assert this, and then Brian Brazil corrected me. Subqueries do not work this simply and straightforwardly, and how they actually work may have implications for how you want to use them.

It turns out that subqueries use times that are evenly divided by the step interval, which the the blog post and other people describe as 'being aligned with'. As part of this subqueries will start (and finish) earlier than specified in order to generate as many metric points as they intuitively should. This even division is in Unix time, which is in UTC, not in your local timezone.

This is all very abstract, so let's use the example of a subquery range of '[5d:1d]'. This should intuitively yield five metric points as the result, and since the step is one day, the aligned times Prometheus picks will be at midnight UTC (ie, when '<timestamp> % 1d' is 0). As I write this, the time is March 19th 01:40 UTC or so (and 9:40 pm March 18th in local time), and if I execute this query now I will get the following Unix timestamps of metric points, shown here with their translation to UTC time:

@1552608000        March 15 00:00 UTC
@1552694400        March 16 00:00 UTC
@1552780800        March 17 00:00 UTC
@1552867200        March 18 00:00 UTC
@1552953600        March 19 00:00 UTC

Notice that the oldest timestamp is earlier than now minus exactly five days of seconds, which would be March 14th at 01:40 UTC, and the most recent timestamp is not 'now' but back at midnight UTC (which was at 8pm local time).

(Note that this is not when the metric points themselves come from; it is when the subquery expression is evaluated. If I use, say, 'timestamp(node_load1)[5d:1d]' to extract both the metric point timestamp and the evaluation timestamp, I get results that differ, as you'd expect. You can see all of this in the Prometheus web interface by making 'console' queries for subquery range expressions; the web interface will show you all of the returned metric points and their timestamps.)

At small subquery steps, like :1m or :10m or even perhaps :1h, this alignment probably doesn't matter a lot. At large time steps this may well be important to you, because Prometheus always aligns your subqueries to UTC, not to local time. There is no way to make a step of :6h or :12h or :1d align to local midnight, local midnight and local noon, and so on, or to be relative to 'now' instead of being aligned with absolute time.

(Apparently the Prometheus people have a reason for doing it this way; I believe it boils down to helping cache things during query evaluation.)

Sidebar: The exact Prometheus code involved

The code involved here is found in prometheus/promql/engine.go, in the eval() method, and reads:

// Start with the first timestamp after (ev.startTimestamp - offset - range)
// that is aligned with the step (multiple of 'newEv.interval').
newEv.startTimestamp = newEv.interval * ((ev.startTimestamp - offsetMillis - rangeMillis) / newEv.interval)
if newEv.startTimestamp < (ev.startTimestamp - offsetMillis - rangeMillis) {
  newEv.startTimestamp += newEv.interval

I believe that ev.startTimestamp is usually the same as the ending timestamp, and if it's not I don't understand what this code is going to wind up doing. The division here is integer division.

PrometheusSubqueriesPointTime written at 22:25:40; Add Comment


An easy optimization for restricted multi-metric queries in Prometheus

Here's something that I've repeatedly had to learn and remember the hard way when I was building Prometheus queries for our dashboards, especially for graphs. Suppose that you have a PromQL query that involves multiple metrics, for instance you want to know the number of inodes used on a filesystem:

node_filesystem_files - node_filesystem_files_free

(The reason that the Prometheus host agent doesn't directly export a 'number of inodes used' metric is that statfs(2) doesn't provide that information on common Unixes; it provides only 'total inodes' and 'inodes free'. The host agent is being honest here. I could say a lot of things about statfs(2), but this is not the entry for it.)

Now, suppose that you have a lot of servers and a lot of filesystems and that you're actually only interested in the number of inodes used for a few of them. For example, you have a Grafana dashboard that displays the information for the root filesystem on a single host. A perfectly sensible way to write the query for this dashboard is:

node_filesystem_files{ host="$host", mountpoint="/" } - node_filesystem_files_free

Unfortunately, current versions of Prometheus (2.8.0 as I write this entry) miss the obvious way of optimizing this query when they execute it. Instead of propagating the label restrictions from the left hand side query to the right hand side as well, the PromQL engine will get all of the metrics for node_filesystem_files_free, across all of your servers and filesystems, and then throw out all but the single one that matches the left hand side.

As a result, any time you have a multi-metric query that is matching labels across the metrics and one or more of the metrics is restricted with label matches, you can usually improve things by replicating the restriction into the other metrics. This goes for arithmetic operators, boolean operators, and 'and', but obviously doesn't apply for 'or' or 'unless'. This improvement doesn't just boost performance; under some circumstances, it can make the difference between a query that gets a Prometheus error about loading too many metrics points and a query that works.

(I find this a bit unfortunate, in that the natural and more maintainable way to write the query is not always the workable way. The performance impact of the less efficient version I can usually live with, but I really don't want my graphs and queries falling over with 'too many metrics points' when I extend the time ranges far enough.)

PrometheusLabelNonOptimization written at 23:47:24; Add Comment

(Previous 10 or go back to March 2019 at 2019/03/11)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.