Wandering Thoughts archives

2019-04-28

A gotcha with stale metrics and *_over_time() in Prometheus

We monitor the available disk space on our fileservers through custom metrics, and we allow people to set alerts on various levels of low disk space, which are evaluated through a Prometheus alert rule. Because several people are allowed to each set alerts on the same filesystem, we have to use group_right in the alert expression. A simplified version of this expression is:

cslab_avail_bytes <= on (filesystem)
  group_right(fileserver, pool) cslab_alert_minfree

Every so often we migrate a filesystem between fileservers (and ZFS pools). For a long time now, for a relatively short while after each migration, we have been getting a mysterious error from Prometheus about this alert rule:

err="found duplicate series for the match group {filesystem=\"/h/103\"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side

The two duplicate series Prometheus would report were old fileserver and new fileserver versions of the filesystem. What made this mysterious to me was that these two time series never existed at the same time. In every migration, the old time series would end and then a few minutes later the new time series would appear (after the filesystem was migrated, remounted on the machine that generates these metrics, and the metrics generation ran). What it looked like to me was that Prometheus was improperly using a stale metric in alert rule evaluation for some reason.

The explanation ultimately turned out to be very simple. You see, I have simplified the alert expression a little too much, omitting something that I didn't realize made a crucial difference. The real version is more like this:

avg_over_time(cslab_avail_bytes[15m]) <= on (filesystem)
  group_right(fileserver, pool) cslab_alert_minfree

The avg_over_time and its range expression makes all the difference. Even though the old fileserver's time series is stale now, it was non-stale at some time a few minutes in the past, and so the range expression dutifully sweeps it up and avg_over_time produces a result for it, creating a duplicate, and then Prometheus quite correctly errors out.

This applies to all *_over_time() things, of course, and in general to any range vector expression; they will sweep up metrics that were valid at any point over the time range, not just metrics that are valid now (well, at the end of the period). If you want to restrict your range vector or *_over_time() to only metrics that are non-stale at the end of the time period, you need to say so explicitly, by using an 'and' operator:

(avg_over_time(cslab_avail_bytes[15m]) and cslab_avail_bytes) [...]

Depending on your particular circumstances, this could have some subtle effects on such 'migrating' metrics. Early on, the new, non-stale metric will have only a few valid data points within the time range, and I believe that the *_over_time() will happily use only these available data points to compute your average or whatever. That could mean that you are actually averaging over a much smaller time period than you think.

For our purposes, the easier and more straightforward fix is to remove the ``avg_over_time()' and use a 'for: 15m' on the alert rule instead. This doesn't have quite the same effect, but the effects it does have are probably easier to explain to people who are setting and getting low disk space alerts.

Once I thought about it, it was intuitive that range vectors and things that use them will return values for (currently) stale time series, but it hadn't occurred to me before I realized it. In fact I was so oblivious to the possibility that when I asked the Prometheus mailing list about this, I provided the first, over-simplified version of the alert rule, because I didn't think the avg_over_time() mattered. As we see, it very much does.

PS: I think this means that if you have an alert rule using a range vector (in a *_over_time() or otherwise), and the alert is currently firing, and you deal with it by removing the underlying metric (perhaps it's coming from an outdated source), the alert may continue firing for a while because the range vector will keep picking up the old values of the now-stale metric until they fall outside of its time range. But probably not very many people deal with spurious alerts by stopping monitoring the metric source.

(With us, it can happen if a machine is converted from something we're monitoring into an experimental machine that may have wild stuff done to it. We tend not to monitor experimental machines at all, so our natural reaction is to remove the existing monitoring when we get reminded of the machine's change in purpose.)

PrometheusStaleMetricsOverTime written at 22:52:03; Add Comment

2019-04-26

Brief notes on making Prometheus instant queries with curl

Every so often, I wind up wanting to directly extract some information from Prometheus without going through a Grafana dashboard or even the Prometheus web UI. At this point I have stubbed my toes on the same issues enough times (ie, more than once) that I'm going to write down some how-to information for my future reference.

In general, the Prometheus HTTP API itself is tersely documented in the obvious spot in the documentation, and see also Extracting raw samples from Prometheus. Since it returns JSON, you'll probably want to have some JSON processing program for command line usage; I'm fond of jq, which at this point should be in everyone's toolbox.

Making a query requires at least a PromQL expression. In simple cases where all you want is all metric points for a bare metric, you can give this directly on the command line for curl, eg:

curl 'localhost:9090/api/v1/query?query=up'

In more complex cases you'll have a PromQL query that requires encoding to be properly interpreted, and perhaps extra parameters on the query. Fortunately Prometheus accepts POST requests just like GET requests, and curl will do the work for us:

curl --data-urlencode 'query=up{job="alertmanager"}' --data-urlencode 'time=1556330052' localhost:9090/api/v1/query

Generally I want to use the -s flag with curl, to make it not spit out status information when I'm feeding its output to jq. I may or may not want to use -f to avoid server error pages. In scripts I want some sort of timeout, such as '-m 10'.

Some sources will provide curl command lines that also use -k. You don't want to use that flag unless you know what it means and you know that you need it and it's safe to use it.

(Possibly there is a clever way to get curl to URL-encode your query parameters in GET requests, but if so I didn't see it in a casual skim of the very large curl manpage. I don't think it matters for querying Prometheus, since Prometheus accepts either GET or POST requests here.)

PS: Extracting raw samples from Prometheus says that 'the remote read endpoint at api/v1/read [...] is more difficult to use [...]'. That would be a polite understatement. It turns out that Prometheus remote read requests and replies are snappy-compressed protobufs, following schemas that you can find here. You can apparently work with protobufs from the command line with curl, per this gist example, but I don't know how you'd handle the snappy compression side of things.

Sidebar: Sending data to Pushgateway with curl

Since I do this too, I might as well add my usual curl command for it here:

<generate metrics as text> |
   curl -X PUT -f -s --data-binary @- localhost:9091/metrics/job/...

I almost always use PUT here instead of the default of POST because of when and what metrics disappear on updates in Pushgateway.

PrometheusQueryWithCurl written at 22:31:04; Add Comment

2019-04-21

My view on upgrading Prometheus (and Grafana) on an ongoing basis

Our normal habit is to use the distribution versions of programs we want to run, on whatever version of whatever OS we're running (usually Ubuntu Linux, but we follow this on CentOS, OpenBSD, and so on). If we can't do that for some reason, we generally freeze the version of the program we're using for the lifetime of the system, unless there are either security issues or strongly compelling upgrades (or, sometimes, that the version we're using is falling out of support so it won't get security updates any more).

We initially started with the Ubuntu 18.04 versions of various Prometheus components, but we pretty rapidly determined that they were too old and lacked crucial bug fixes and switched over to the upstream prometheus.io versions of the time. This would normally leave us freezing the versions of what we were running once we had reached a functional and sufficiently bug free state. While I considered that initially, I changed my mind and now believe the right thing for us to do is to actively keep up with the upstreams for both Prometheus and Grafana.

The problem with freezing our versions of either Prometheus or Grafana is that both do regular, frequent releases; both projects put out new significant versions at most every few months and often faster. Freezing our versions for even a year would put us multiple versions behind, and normally we'd freeze things until 2022 when we had to start planning an upgrade of the metrics server's Ubuntu version (it's currently running 18.04).

There are at least two problems with having to upgrade from a lot of versions behind. The first is that that's a lot of changes to absorb and deal with all at once, simply because a lot of releases means a lot of changes. Not all of them are significant or matter, but some of them do and we'd have to sort them out, read through tons of release notes, and so on. Spreading this out over time by updating to keep up to date lowers the burden in practice and makes it less likely that something important will slip through the cracks.

The second is that it may not be easy or even possible to cleanly jump from a version that is so far behind to the current one. In the easier case, we will have missed the deprecation window for some API, configuration file change, or practice, and we will have to upgrade a bunch of things in synchronization so they all jump from the old way to the new way at once. If we had kept up to date, we'd have had time to migrate while both an old way and a new way were still supported. In the hard case, there's been a database or storage format migration except that we're skipping straight from a version that only had the old format to a version that only supports the new one. Probably the only solution here would be to temporarily bring up an intermediate version purely to do the migration, then skip forward further.

(Another advantage of upgrading frequently is that you get to spot undocumented regressions and changes right away, and perhaps file bugs to get developers to fix them. Even if you can't do anything, you have a relatively confined set of changes going on at once, which helps to troubleshoot problems. Turning over the entire world has the drawback that, well, you've turned over the entire world.)

This is really nothing new. It's always been pretty clear that a slow stream of generally minor changes are generally easier to deal with than a periodic big bang of a lot of them at once. It's just that I relatively rarely get faced with the choice of this, especially in a situation where version to version upgrades are relatively painless.

(When version to version upgrades are not painless, there is a very strong temptation to put the pain off for as long as possible rather than keep going through it over and over. This is how we are behind on Django versions for our web app.)

PrometheusAndUpgrades written at 01:29:32; Add Comment

2019-04-18

A pattern for dealing with missing metrics in Prometheus in simple cases

Previously, I mentioned that Prometheus expressions are filters, which is part of Prometheus having a generally set-oriented view of the world. One of the consequences of this view is that you can quite often have expressions that give you a null result when you really want the result to be 0.

For example, let's suppose that you want a Grafana dashboard that includes a box that tells you how many Prometheus alerts are currently firing. When this happens, Prometheus exposes an ALERTS metric for each active alert, so on the surface you would count these up with:

count( ALERTS{alertstate="firing"} )

Then one day you don't have any firing alerts and your dashboard's box says 'N/A' or 'null' instead of the '0' that you want. This happens because 'ALERTS{alertstate="firing"}' matches nothing, so the result is a null set, and count() of a null set is a null result (or, technically, a null set).

The official recommended practice is to not have any metrics and metric label values that come and go; all of your metrics and label sets should be as constant as possible. As you can tell with the official Prometheus ALERTS metric, not even Prometheus itself actually fully follows this, so we need a way to deal with it.

My preferred way of dealing with this is to use 'or vector(0)' to make sure that I'm never dealing with a null set. The easiest thing to use this with is sum():

sum( ALERTS{alertstate="firing"} or vector(0) )

Using sum() has the useful property that the extra vector(0) element has no effect on the result. You can often use sum() instead of count() because many sporadic metrics have the value of '1' when they're present; it's the accepted way of creating what is essentially a boolean 'I am here' metric such as ALERTS.

If you're filtering for a specific value or value range, you can still use sum() instead of count() by using bool on the comparison:

sum( node_load1 > bool 10 or vector(0) )

If you're counting a value within a range, be careful where you put the bool; it needs to go on the last comparison. Eg:

sum( node_load1 > 5 < bool 10 or vector(0) )

If you have to use count() for more complicated reasons, the obvious approach is to subtract 1 from the result.

Unfortunately this approach starts breaking down rapidly when you want to do something more complicated. It's possible to compute a bare average over time using a subquery:

avg_over_time( (sum( ALERTS{alertstate="firing"} or vector(0) ))[6h:] )

(Averages over time of metrics that are 0 or 1, like up, are the classical way of figuring out things like 'what percentage of the time is my service down'.)

However I don't know how to do this if you want something like an average over time by alert name or by hostname. In both cases, even alerts that were present some of the time were not present all of the time, and they can't be filled in with 'vector(0)' because the labels don't match (and can't be made to match). Nor do I know of a good way to get the divisor for a manual averaging. Perhaps you would want to do an unnecessary subquery so you can exactly control the step and thus the divisor. This would be something like:

sum_over_time( (sum( ALERTS{alertstate="firing"} ) by (alertname))[6h:1m] ) / (6*60)

Experimentation suggests that this provides plausible results, at least. Hopefully it's not too inefficient. In Grafana, you need to write the subquerry as '[$__range:1m]' but the division as '($__range_s / 60)', because the Grafana template variable $__range includes the time units.

(See also Existential issues with metrics.)

PrometheusMissingMetricsPattern written at 00:39:58; Add Comment

2019-04-13

Remembering that Prometheus expressions act as filters

In conventional languages, comparisons like '>' and other boolean operations like 'and' give you implicit or explicit boolean results. Sometimes this is a pseudo-boolean result; in Python if you say 'A and B', you famously get either False or the value of B as the end result (instead of True). However, PromQL doesn't work this way. As I keep having to remember over and over, in Prometheus, comparisons and other boolean operators are filters.

In PromQL, when you write 'some_metric > 10', what happens is that first Prometheus generates a full instant vector for some_metric, with all of the metric points and their labels and their values, and then it filters out any metric point in the instant vector where the value isn't larger than 10. What you have left is a smaller instant vector, but all of the values of the metric points in it are their original ones.

The same thing happens with 'and'. When you write 'some_metric and other_metric', the other_metric is used only as a filter; metric points from some_metric are only included in the result set if there is the same set of labels in the other_metric instant vector. This means that the values of other_metric are irrelevant and do not propagate into the result.

The large scale effect of this is that the values that tend to propagate through your rule expression are whatever started out as the first metric you looked at (or whatever arithmetic you perform on them). Sometimes, especially in alert rules, this can bias you toward putting one condition in front of the other. For instance, suppose that you want to trigger an alert when the one-minute load average is above 20 and the five-minute load average is above 5, and you write the alert rule as:

expr: (node_load5 > 5) and (node_load1 > 20)

The value available in the alert rule and your alert messages is the value of node_load5, not node_load1, because node_load5 is what you started out the rule with. If you find the value of node_load1 more useful in your alert messages, you'll want to flip the order of these two clauses around.

As the PromQL documentation covers, you can turn comparison operations from filters into pseudo-booleans by using 'bool', as in 'some_metric > bool 10'. As far as I know, there is no way to do this with 'and', which always functions as a filter, although you can at least select what labels have to match (or what labels to ignore).

PS: For some reason I keep forgetting that 'and', 'or', and 'unless' can use 'on' and 'ignoring' to select what labels you care about. What you can't do with them, though, is propagate some labels from the right side into the result; if you need that, you have to use 'group_left' or 'group_right' and figure out how to re-frame your operation so that it involves a comparison, since 'and' and company don't work with grouping.

(I was going to confidently write an entry echoing something that I said on the Prometheus users mailing list recently, but when when I checked the documentation and performed some tests, it turned out I was wrong about an important aspect of it. So this entry is rather smaller in scope, and is written mostly to get this straight in my head since I keep forgetting the details of it.)

PrometheusExpressionsFilter written at 23:59:31; Add Comment

2019-04-07

Why selecting times is still useful even for dashboards that are about right now

In the aftermath of our power outage, one of the things that I did was put together a Grafana dashboard that was specifically focused on dealing with large scale issues, specifically a lot of machines being down or having problems. In this sort of situation, we don't need to see elaborate status displays and state information; basically we want a list of down machines and a list of other alerts, and very little else to get in the way.

(We have an existing overview dashboard, but it's designed with the tacit assumption that only a few or no machines are down and we want to see a lot of other state information. This is true in our normal situation, but not if we're going through a power shutdown or other large scale event.)

This dashboard will likely only ever be used in production displaying the current time, because 'what is (still) wrong right now' is its entire purpose. Yet when I built it, I found that I not only wanted to leave in the normal Grafana time setting options but specifically build in a panel that would let me easily narrow in on a specific (end) time. This is because setting the time to a specific point is extremely useful for development, testing, and demos of your dashboard. In my case, I could set my in-development dashboard back to a point during our large scale power outage issues and ask myself whether what I was seeing was useful and complete, or whether it was annoying and missing things we'd want to know.

(And also test that the queries and Grafana panel configurations and so on were producing the results that I expected and needed.)

This is obviously especially useful for dashboards that are only interesting in exceptional conditions, conditions that you hopefully don't see all the time and can't find on demand. We don't have large scale issues all that often, so if I want to see and test my dashboard during one before the next issue happens I need to rewind time and set it at a point where the last crisis was happening.

(Now that I've written this down it all feels obvious, but it initially wasn't when I was staring at my dashboard at the current time, showing nothing because nothing was down, and wondering how I was going to test it.)

Sidebar: My best time-selection option in Grafana

In my experience, the best way to select a time range or a time endpoint in Grafana is through a graph panel that shows something over time. What you show doesn't matter, although you might as well try to make it useful; what you really care about is the time scale at the bottom that lets you swipe and drag to pick the end and start points of the time range. The Grafana time selector at the top right is good for the times that it gives fast access to, but it is slow and annoying if you want, say, '8:30 am yesterday'. It is much faster to use the time selector to get your graph so that it includes the time point you care about, then select it off the graph.

DashboardSetTimeUseful written at 22:45:30; Add Comment

2019-04-05

It's always DNS (a story of our circular dependency)

Our building and in fact much of the University of Toronto downtown campus had a major power failure tonight. When power came back on I wasn't really expecting our Ubuntu servers to come back online, but to my surprise they started pinging (which meant not just that the actual servers were booting but that the routers, the firewall, the switches, and so on had come back). However when I started ssh'ing in, our servers were not in a good state. For a start, I didn't have a home directory, and in fact none of our NFS filesystems were mounted and the machines were only part-way through boot, stalled trying to NFS mount our central administrative filesystem.

My first thought was that our fileservers had failed to boot up, either our new Linux ones or our old faithful OmniOS ones, but when I checked they were mostly up. Well, that's getting ahead of things, because when I started to check what actually happened is that the system I was logged in to reported something like 'cannot resolve host <X>'. That would be a serious problem.

(I could resolve our hostnames from an outside machine, which turned out to be very handy since I needed some way to get their IPs so I could log into them.)

We have a pair of recursive OpenBSD-based resolvers; they had booted and could resolve external names, but they couldn't resolve any of our own names. Our configuration uses Unbound backed by NSD, where the NSD on each resolver is supposed to hold a cached copy of our local zones that is refreshed from our private master. In past power shutdowns, this has allowed the resolvers to boot and serve DNS data from our zones even without the private master being up, but this time around it didn't; both NSDs returned SERVFAIl when queried and in 'nsd-control zonestatus' reported things like:

zone: <our-zone>
      state: refreshing
      served-serial: none
      commit-serial: none

Our private master was up, but like everything else it was stalled trying to NFS mount our central administrative filesystem. Since this central filesystem is where our nameserver data lives, this was a hard dependency. This NFS mount turned out to be stalled for two reasons. The obvious and easy to deal with one was that the private master couldn't resolve the hostname of the NFS fileserver. When I tried to mount by IP address, I found the second one; the fileserver itself was refusing mounts because, without working DNS, it couldn't map IP addresses to names to verify NFS mount permission.

(To break this dependency I wound up adding NFS export permission for the IP address of the private master, then manually mounting the filesystem from the fileserver's IP on the private master. This let the boot continue, our private master's nameserver started, our local resolvers could refresh their zones from it, and suddenly internal DNS resolution started working for everyone. Shortly afterward, everyone could at least get the central administrative filesystem mounted.)

So, apparently it really always is DNS, even when you think it won't be and you've tried to engineer things so that your DNS will always work (and when it's worked right in the past).

OurDNSCircularDependency written at 01:42:40; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.