Wandering Thoughts archives

2020-12-30

Some ways to do a Prometheus query as of a given time

Yesterday, I wrote about my wish for easier ways to evaluate a Prometheus query at a given time. Today I want to write down a few ways to do this for my future use, partly since I discovered one more way than I'd known when I went poking around after writing that entry.

Prometheus has an ad-hoc query view, which gives you answers either in a table or in a graph. The graph view has always supported adjusting the end time of the graph, in the same area above the graph where you change the time range and so on. In the table view you have the same option in the small 'Evaluation time' section. As of 2.23.0, this has a bug where if you directly edit the date or time (instead of picking from the calendar), your query may not be properly re-evaluated until you click outside the calendar, click on the time again to get the calendar to appear, and then click outside it again. If the evaluation time is wrong, clicking 'Execute' doesn't seem to fix it, which is irritating. But all of this is a bug and will probably go away at some point.

You can manually do instant queries with curl and as part of that you can specify the query time, as I wrote up in my notes on making Prometheus queries with curl. The normal form of this query is:

curl -s --data-urlencode 'query=up{job="alertmanager"}' --data-urlencode 'time=1556330052' http://localhost:9090/api/v1/query

The time is specified as a Unix timestamp. The most convenient way to get this is with some useful features of GNU date:

date -d 'Fri Apr 26 21:54:12 EDT 2019' +%s

It also turns out that promtool can be used to do this as part of 'promtool query instant', using the '--time' option (as of at least Prometheus 2.23.0):

promtool query instant --time 1609301510 http://localhost:9090/ node_load1

Unlike with direct manual queries with curl, you don't directly supply the URL of the query API, you instead supply the top level Prometheus URL. Promtool will put on the rest of the URL components as needed.

As documented in 'promtool query instant --help', --time takes either a Unix timestamp or a RFC 3339 time string. Such a time string can be produced by GNU Date, but to produce a RFC 3339 time string that promtool will accept, you want to use the '--iso-8601=seconds' command line option, not the option that GNU Date incorrectly claims is for RFC 3339 format. You may or may not want to use --utc.

(You can get JSON output by using 'promtool query instant -o json'.)

My current view is that using curl is better if you're going to process the result as JSON, while 'promtool query instant' is better if you just want to read the results. If I needed to use either for this regularly I would probably write a cover script, and I would probably have the cover script use 'date -d ...' on whatever time string I handed it for maximum convenience.

(Both curl and promtool have the annoyance for ordinary shell usage that you have to supply a probably gigantic URL to your Prometheus server. This URL is almost always constant but there's no way to provide a default value. promql-cli lets you provide a default Prometheus URL, but currently has no option for making an instant query at a time.)

PrometheusQueryAtTimeHow written at 23:42:11; Add Comment

A Prometheus wish: easy ways to evaluate a PromQL query at a given time

To do most things in Prometheus, you use PromQL to make queries against Prometheus' internal time series database that holds all of the metrics that your system has collected over however long your retention time is. PromQL queries are always evaluated at a given time, with the default for simple queries being 'right now' (this is true even for queries that ask for, say, the average of something over the past five minutes; those past five minutes start at the query's evaluation time).

PromQL has support for evaluating an expression at a time relative to the current time (well, the query's time), in the form of the offset modifier. This is handy for comparing the state of things now to the state of things some time ago, and using offset can make your queries more efficient (although there can be gotchas with using offset instead of delta()).

What Prometheus and PromQL have limited support for is evaluating a PromQL query at a specific (absolute) time. This is partly a PromQL language issue and partly a tool and user interface issue. The underlying instant query HTTP API allows you to specify the time that the query is evaluated at (what is considered 'now' for it), so it's possible to make whole queries at non-default times. The current Prometheus UI for ad-hoc queries actually does have support for manually setting the query time, but it's not very obvious and it has some bugs (which will probably get fixed in the future).

There are some tools for making PromQL queries from the command line, such as promql-cli, but I don't know of any current one that lets you specify an instant query time in a human readable format. Making raw queries with curl will let you do this, but you have to work out the Unix timestamp you need from the date (which can be done with GNU date, by combining some features covered in some useful features of GNU date).

PromQL itself has no syntax for making queries at absolute times, unlike for relative times with offset. You might think that you can press offset into service anyway by computing the offset on the fly based on the difference between the current time (available via time()) and the target time, but unfortunately offset specifically requires a duration and you can't give it an expression. There are likely good reasons to limit offset this way, but it does make me wish for an 'at' modifier, even one that only took a Unix timestamp.

(What is frustrating about this is that of course the information is in there and the capacity to access it is intrinsic in Prometheus. It's just not well exposed or problem-free.)

PrometheusQueryAtTimeWish written at 00:44:14; Add Comment

2020-12-27

Our alerts are quiet most of the time (as they should be)

It's the middle of the University of Toronto's winter break right now so officially we're off work. Unofficially we're checking in on email and we'll probably notice if something explodes. During this time, one of the things that has made this not much of a burden is that we basically haven't gotten any alerts from our Prometheus system. This is by design.

One part of the design is that our overall environment is not supposed to have (significant) problems, and generally doesn't. The other part of the design is that a major criteria of setting up our alerts has been how noisy they are. I've been deliberately fanatical about this (both for new alerts we're considering adding and for current alerts), pretty much always choosing to err on the side of silence. This has meant that there are some situations that we just won't catch because I haven't been able to figure out how to create a non-noisy alert for them and eventually gave up.

(One of these is the flaky SMART errors from our Crucial MX500 SSDs.)

There are some technical things that have helped in this, of course. One of these is that we have a long Prometheus metrics retention period, which means that I can almost always evaluate a potential alert condition to see how frequently it would have triggered in the past (both the near past and the far past). If an alert rule looks like it would have triggered too often, at times when we didn't really have a problem, I have to revise it or drop the idea. Another is some deliberate infrastructure for 'testing' alerts, which are sent to a special destination that only goes to me instead of to all of my co-workers. I use this both for testing new alert messages and similar things and simply for evaluating alerts overall if I'm not certain about them. Not all testing alerts get promoted to production ones that notify everyone.

On a process level, I look at alerts we receive and continually ask if they're really meaningful and either actionable or important to know. If they don't seem to be any more (sometimes because conditions have changed), I will usually remove them after talking with my co-workers (who sometimes see more value in some alerts than I do). And if something is useful but not at the level of an alert, maybe I'll surface it on a dashboard.

One of the things that makes this work is that I'm both the (de-facto) alert maintainer and one of the sysadmins who receives and acts on alerts. This gives me both the ability to see if alerts are useful and the capability to immediately act on that knowledge; there's no need to go back and forth between two different groups or people, and no friction to it.

(Just in talking to people casually about alerting I've learned that my experiences here are far from universal.)

OurQuietAlertsAsDesigned written at 00:46:24; Add Comment

2020-12-15

How to make Grafana properly display a Unix timestamp

There are many things in Prometheus that generate a 'time' value that is in seconds since the Unix epoch of January 1st 1970 00:00 UTC. One such example, relevant to yesterday's entry on the time when Prometheus metrics happened, is the result of the Prometheus timestamp() function. When you're working with such times, it's common to want to display them in a Grafana table (perhaps a multi-value table), and one natural way to display them is simply as, well, the date and time they represent.

Sadly, Grafana makes this unnecessarily annoying, to the extent that every time I've needed to do this I've had to do Internet searches to find out how (again). If you edit a Grafana table, you can go to Column Styles and set the Type and Unit; the Type offers a 'Date', and for the Number type the Units has a 'Date & Time' section that offers options like 'Datetime local'. If you set either of these options on your Unix epoch timestamp, you will get a displayed result that is generally some time in 1970, and you will be unhappy and angry at Grafana.

The secret is that Grafana expects Unix timestamps in milliseconds, not seconds. Prometheus and many other things generate timestamps in seconds, not milliseconds; if you just pass these timestamps straight to Grafana and have it display them as-is, they wind up being interpreted as millisecond-based times that are way too small. To make things work right, you must multiply all Unix timestamps by 1000 before having Grafana display them. When you're using data from Prometheus, this has to be done in the Prometheus query itself in the Grafana dashboard, which will leave you with mysterious '* 1000' bits in your queries.

Once you're turning your Unix timestamps from seconds to milliseconds, either the 'Date' column type or one of the 'Date & Time' units will display your timestamp. Right now I'm using the 'Datetime local' option of the latter because it seems less magical than assuming that Grafana's 'Date' type will always do what I want it to.

Sadly, this particular peculiar decision is now baked into Grafana in order to retain backward compatibility with current dashboards and panels. Grafana would have to introduce a new 'Date (seconds)' or 'Datetime local (seconds)' to do it right, and I hope that they do someday. Or introduce unit conversions in panel transformations.

(This is one of those entries that I write for myself so that I can find this information the next time I need it, because I'm sure I'm going to need it again. Grafana's behavior here is so counterintuitive and so opaque it's impressive. They don't even hint at the underlying units in the names, where they could do things like 'Datetime local (ms)'.)

GrafanaUnixEpochTime written at 21:07:42; Add Comment

In Prometheus, it's hard to work with when metric points happened

I generally like Prometheus and PromQL, its query language (which you have to do basically everything through). But there is one frustrating exception that I routinely run into, and that is pretty much anything to do with when a metric point happened.

PromQL is great about letting you deal with the value of metrics. But the value is only half of a metric, because each sample (what I usually call a metric point) are a value plus a timestamp (cf). Prometheus stores the timestamp as well as the value, but it has relatively little support for accessing and manipulating the timestamps, much less than it does for the values. However there are plenty of times when the timestamp is at least as interesting as the value. One example is finding out the most recent time that a metric existed, which is useful for all sorts of reasons.

When I started writing this entry (and when I wrote those tweets), I was going to say that Prometheus makes it unnecessarily hard and should fix the problem. But the more I think about it, that's only true for simple cases. In more complex cases you need to conditionally select metric points by their value, and at that point Prometheus has no simple solution. In the current state of Prometheus, any 'evaluation over time' that needs something more sophisticated than matching labels must use Prometheus subqueries, and that holds whether it's using the value of the metric point or the time it happened at.

Prometheus makes the simple case slightly harder because timestamp() is a function, which means that anything using it must be written as a subquery. But that only hurts you with relatively simple queries, like finding the most recent time a metric existed:

max_over_time( timestamp( vpn_users{user="fred"} )[4w:] )

Even without this you'd have to put the '[4w]' somewhere (although having it as a subquery may be slightly less efficient in some cases).

Now suppose that you want to know the most recent time that pings failed for each of your Blackbox ICMP probes. This requires a conditional check, but the query can be almost the same as before through the magic of Prometheus operators being set operations:

max_over_time( timestamp( probe_success{probe="icmp"} == 0 )[24h:] )

What makes some cases of dealing with values over time in PromQL simpler (for example working how frequently pings fail) isn't any feature in PromQL itself, it's that those metric values are carefully arranged so that you can do useful things with aggregation over time. Timestamps are fundamentally not amenable to this sort of manipulation, but they're far from alone; many metrics have values that you need to filter or process.

(You can get the earliest time instead of the latest time with min_over_time. Other aggregation over time operators are probably not all that useful with timestamps.)

PrometheusMetricTimeHard written at 00:12:22; Add Comment

2020-12-12

Sometimes a problem really is just a coincidence

This past Wednesday, we did some maintenance, including precautionary reboots of some of our OpenBSD firewalls. All of our firewalls actually are pairs of machines, one of which is the active firewall and the other of which is the inactive spare (which is at most running pfsync, and is not on the live networks or doing anything). One of the first firewalls we rebooted was what was supposed to be the inactive spare of the bridging firewall that sits between our networks and the rest of the university. Less than a minute after the reboot was initiated, our monitoring system was screaming that we had basically lost all connectivity to the outside world.

Naturally people went digging to try to understand what had happened. We had not accidentally rebooted the live firewall instead of the inactive spare (an easier mistake to make with a bridging firewall than with a routing one), the reboot didn't seem to have somehow influenced the live firewall, our core router had not seen the interface status change, and so on and so forth. Later, we examined our reachability metrics in more detail (including data from an outside perspective) and became even more confused, especially since the reachability data from outside showed that we'd had problems accessing some things not even behind our bridging firewall.

I'll jump to the punchline: it was a coincidence. The overall university network had had some problems that happened to start only very shortly before the reboot of the inactive spare firewall (and by 'only very shortly' I mean less than 60 seconds before the reboot started). There may also have been a small power fluctuation in the building at around the same time, too. If the overall networking problems had dragged on the coincidence would have been more obvious, but instead they faded out within about six minutes of the inactive spare firewall being back up, which was well within the time period where the co-worker actually in the office was poking around at things and trying to figure out what was going on.

It wasn't necessarily wrong of us to immediately assume that the reboot of a firewall was the cause and to look into things around it; the sysadmin's version of Occam's Razor is that if you just did something and a problem shows up, your action is the most likely cause. Often it really is the cause. But not always, as we saw this time, so if things don't seem to make sense maybe we should also start thinking about possible alternate explanations (and where we'd find evidence for or against them).

(In this case, there was nothing we could do to fix the problem since it was outside of our network, so the time spent poking around didn't delay resolving the issue.)

(This elaborates on a tweet of mine.)

SometimesCoincidence written at 00:17:09; Add Comment

2020-12-07

Exploring when the network is up on a machine

It's a common sysadmin desire to do things during (Unix) system boot 'once the network is up'. But what do we actually mean by that, in relatively concrete terms? The many possible meanings of this, and the varied environments we can ask the question in, are part of why this can be a tangled issue on Linuxes that use systemd.

To simplify life and because it's the main thing I care about, I'm going to consider only servers with a static set of network devices and fixed networking (at least to the level of DHCP on a specific interface, although we statically configure IP addresses). Even here, I think that there are actually two broad views of when the network is up, which I will call the server view and the client view.

For the server view, networking is up when the firewall rules have been loaded and all of the system's important IP addresses have been configured. This is the point where you can start daemons (or services) that bind to specific IPs and have them safely serve only the traffic you want them to see. You don't need interfaces to actually be up or for routes to be set, because if one of those isn't in place (or isn't working), to the outside world it's as if the machine hasn't booted yet.

For the client view, networking is up when (some) IP addresses have been set, interfaces are up with link signal, and routes are in place (on some systems you need firewall rules loaded too). This is when programs on the machine can reasonably talk to the outside world to make DNS queries, send log messages, check authorization status, fetch resources off the network, or whatever other important interactions they have. Often you don't want programs and services to run before they can do these important things because they'll just fail, sometimes in annoying ways.

As hinted here, some programs that provide network services also are network clients. For these programs, you often want both versions of 'the network is up' before they start. Even if a service works with just the server view, you probably want to have the system's time set correctly (from a NTP query) and to have logs go to your log host, not to have it running with the wrong time and logs going into the void because it can't connect to the log host.

Not all interfaces, IP addresses, and routes are necessarily required for either the server or client views on a particular machine (firewall rules are usually loaded as a unified whole). In some situations, this can even include the interface with the default route; this may be a public facing IP while all of your client-view things use other networks. Of course your service itself won't be 'up' to its clients until the public interface is alive, but you can start programs on the machine before then instead of having to stall.

Given all of this, it's not really surprising that Unix systems have struggled to provide useful versions of 'the network is up'. Any single definition of what that means is going to disappoint some people some of the time. Life is especially hard for people who have some interfaces and IPs that are critical and some that aren't; it's unlikely that any general purpose system will be able to automatically do what we want in that situation. Probably the best we can hope for is for a way to tell the system what we mean by 'the network is up' for this particular machine from among all of the components and options.

(This set of thoughts was sparked by reading Starting services only when the network is ready on Debian/systemd (via).)

PS: If your machines get their IP addresses via DHCP instead of static configuration, the server and client views start to collapse under normal circumstances. You can't have IP addresses for the server view unless you have link connectivity to at least the DHCP server, which often gets you most of the way to the client view too. But the price of this is that if you don't have link connectivity, you don't even have the server view of the network being up; instead, nothing gets anywhere. And on a multi-interface machine you get new and exciting modes of partial failure, where you can talk to one interface's DHCP server but not another's. Whether this results in a system with the server or client view of the network being up depends on what you need the interfaces for.

NetworkUpWhatWeMean written at 23:06:33; Add Comment

2020-12-02

Prometheus 2.23.0 now lets you display graphs in local time

Over in the Fediverse, I said:

A quietly great thing about Prometheus 2.23.0 is that it's finally easy and natural to display graphs for quick ad-hoc queries in local time instead of UTC.

(No, we don't run our servers or do our work in UTC. There are many reasons for this, including that our users don't work in UTC or around the clock.)

All of our serious, full time graphs and other information are in Grafana dashboards, which normally show all of their times in local time (I believe the local time of your browsers, but perhaps it's the server local time; in my case they're the one and the same). However, for poking around and exploring things on an ad hoc basis I generally find Prometheus's own query and graph options to be better than Grafana's Explore feature, and I default to using it.

Prometheus has always been very firm that it works in UTC, not in whatever local time. As part of this, its ad hoc query and graph display always showed UTC time. Since we run our servers in our local time zone, not UTCS, this made it rather annoying to correlate between problem reports and logs (in local time) and the Prometheus ad-hoc graphs (in UTC). This sometimes pulled me back to Grafana's Explore, despite liking it less and finding it more work to deal with.

The 2.23.0 release notes don't come out to say that you can now get Prometheus's ad-hoc graphs and queries to display in your local time zone. Instead they just have, as one of their changes:

  • [CHANGE] UI: Make the React UI default. #8142

However, I had already tried out the React version of the UI when it started to show up as an experimental feature in earlier versions of Prometheus, so I knew that one of the options it had was a little tick box for 'Use local time', a tick box (and option) that was not present in the old UI.

(Of course I could always switch to the React UI in previous versions of Prometheus, but it was labeled as experimental and potentially incomplete, and it was always a little bit of extra work to use it. I did turn to it if I really needed a graph in local time, but usually I lived with the friction of having things in UTC.)

PrometheusNewLocalTime written at 23:40:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.