Wandering Thoughts archives

2019-10-11

A YAML syntax surprise and trick in Prometheus Alertmanager configuration

In a comment on my entry on doing reboot notifications with Prometheus, Simon noted:

Just a note to say that since Alertmanager v0.16.0, it is possible to group alerts by all labels using "group_by: [...]".

When I saw this syntax in the comment, my eyebrows went up, because I'd never seen any sort of ... syntax in YAML before; I had no idea it was even a thing you could do in YAML, and I didn't know what it really meant. Was it some special syntax that flagged what would normally be a YAML array for special processing, for example? So I scurried off to the Wikipedia YAML entry, then the official YAML site and the specification, and finally the Alertmanager source code (because sometimes I'm a systems programmer).

As it turns out this is explained (more or less) in the current Alertmanager documentation, if you read all of the words. Let me quote them:

To aggregate by all possible labels use the special value '...' as the sole label name, for example:
group_by: ['...']

However, the other part of this documentation is less clear, since it lists things as:

[ group_by: '[' <labelname>, ... ']' ]

What is actually going on here is that although the ... looks like YAML syntax, it's actually just a YAML string. The group_by setting is an array of (YAML) strings, which are normally the Prometheus labels to group by, but if you use the string value '...' all by itself, Alertmanager behaves specially. This can be written in a way that looks like syntax instead of a string because YAML allows a lot of unquoted things to be taken as strings (what YAML calls scalars).

(I'm honestly not sure when you have to quote a YAML string.)

The way that Alertmanager documents this makes it reasonably clear that the '...' is an unusual label, not a bit of YAML syntax, since the documentation both explicitly says so and shows it in quoted form (except in a place where the quotes sort of have a different meaning). However, writing it without the explicit quotes makes things much more confusing unless you're already in tune enough with YAML to get what's going on.

My suspicion is that a lot of people aren't going to be that in tune with YAML, partly because YAML is complex, which makes it easy to believe that there's some aspect of YAML syntax you don't know or don't remember. Certainly this experience has reinforced my view that I should be as explicit as possible in our Prometheus YAML usage, even if it's not necessary under the rules. I should also use a consistent style about whether some things are always quoted or not, instead of varying it around for individual rules, configuration bits, and so on.

(Also I should generally avoid any clever YAML things unless I absolutely have to use them.)

YamlSyntaxSurprise written at 21:46:57; Add Comment

2019-10-08

How we implement reboot notifications when our machines reboot in Prometheus

I wrote yesterday about why we generate alerts that our machines have rebooted, but not about how we do it. It turns out that there are a few little tricks about doing this in Prometheus, especially in an environment where you're using physical servers.

The big issue is that Prometheus isn't actually designed to send notifications; it's designed to have alerts. The difference between a notification and an alert is that you send a notification once and then you're done, while an alert is raised, potentially triggers various sorts of notifications after some delay, and then goes away. To abuse some terms, a notification is edge triggered while an alert is level triggered. To create a notification in a system that's designed for alerts, we basically need to turn the event we want to notify about into a level-triggering condition that we can alert on. This condition needs to be true for a while, so the alert is reliably triggered and sent (even in the face of delays or failure to immediately scrape the server's host agent), but it has to go away again sooner or later (otherwise we will basically have a constantly asserted alert that clutters things up).

So the first thing we need is a condition (ie, a Prometheus expression) that is reliably true if a server has rebooted recently. For Linux machines, what you want to use looks like this:

(node_time_seconds - node_boot_time_seconds) < (19*60) >= (5*60)

This condition is true between five minutes after the server rebooting and 19 minutes, and its value is how long the server has been up (in seconds), which is handy for putting in the actual notification we get. We delay sending the alert until the server has been up for a bit so that if we're repeatedly rebooting the server while working on it, we won't get a deluge of reboot notifications; you could make this shorter if you wanted.

(We turn the alert off after the odd 19 minutes because our alert suppression for large scale issues lingers for 20 minutes after the large scale situation seems to have stopped. By cutting off 'recent reboot' notifications just before that, we avoid getting a bunch of 'X recently rebooted' when a bunch of machines come back up in such a situation.)

The obvious way to write this condition is to use 'time()' instead of 'node_time_seconds'. The problem with this is that what the Linux kernel actually exposes is how long the system has been up (in /proc/uptime), not the absolute time of system boot. The Prometheus host agent turns this relative time into an absolute time, using the server's local time. If we use some other source of (absolute) time to try to re-create the time since reboot (such as Prometheus's idea of the current time), we run into problems if and when the server's clock changes after boot. As they say, ask me how I know; our first version used 'time()' and we had all sorts of delayed reboot notifications and so on when servers rebooted or powered on with bad time.

(This is likely to be less of an issue in virtualized environments because your VMs probably boot up with something close to accurate time.)

The other side of the puzzle is in Alertmanager, and comes in two parts. The first part is simply that we want our alert destination (the receiver) for this type of 'alerts' to not set send_resolved to true, the way our other receivers do; we only want to get email at the start of the 'alert', not when it quietly goes away. The second part is defeating grouping, because Alertmanager is normally very determined to group alerts together while we pretty much want to get one email per 'notification'. Unfortunately you can't tell Alertmanager to group by nothing ('[]'), so instead we have a long list of labels to 'group by' which in practice make each alert unique. The result looks like this:

- match:
    cstype: 'notify'
  group_by: ['alertname', 'cstype', 'host', 'instance', 'job', 'probe', 'sendto']
  receiver: notify-receiver
  group_wait: 0s
  group_interval: 5m

We put the special 'cstype' label on all of our notification type alerts in order to route them to this. Since we don't want to group things together and we do want notifications to be immediate, there's no point in a non-zero group_wait (it would only delay the email). The group_interval is to reduce how much email we'd get if a notification started flapping for some reason.

(The group interval interacts with how soon you trigger notifications, since it will effectively suppress genuine repeated notifications within that time window. This can affect how you want to write the notification alert expressions.)

Our Alertmanager templates have special handling for these notifications. Because they aren't alerts, they generate different email Subject: lines and have message bodies that talk about notifications instead of alerts (and know that there will never be 'resolved' notifications that they need to tell us about).

All in all using Prometheus and Alertmanager for this is a bit of a hack, but it works (and works well) and doing it this way saves us from having to build a second system for it. And, as I've mentioned before, this way Prometheus handles dealing with state for us (including the state of 'there is some sort of large scale issue going on, we don't need to be deluged with notes about machines booting up').

PrometheusDoingRebootAlerts written at 21:11:12; Add Comment

2019-10-07

Why we generate alert notifications about our machines having rebooted

Part of our Prometheus alerts is an alert that triggers whenever a machine has been recently rebooted. My impression is that having such alerts these days is unusual, so today I'm writing up the two reasons why we have this alert.

(This is an 'alert' in the sense that all of the output from our Prometheus and Alertmanager is an 'alert', but it is not an alert in the sense of bothering someone outside of working hours. All of our alerts go only to email, and we only pay attention to email during working hours.)

The first reason is that our machines aren't normally supposed to reboot (even most of the ones that are effectively cattle instead of pets, although there are some exceptions). Any unexpected reboot is an anomaly that we want to investigate to try to figure out what's going on. Did we have a power glitch in the middle of the night? Did something run into a kernel panic? And so on. Our mechanism for getting notified about these anomalies is email and the easiest way to send that email is as an 'alert'.

But that's only part of the story, because we don't just monitor these machines to see if they reboot, we also monitor them to see if they go down and trigger alerts if they do. Our machines don't take forever to reboot, but with all of the twiddling around the modern BIOSes perform they do take long enough that our regular 'the machine is down' alerts should fire. So the second reason that we have a specific reboot alert is because we delay the regular 'machine is down' alerts for long enough that they won't actually fire if the machine is just rebooting immediately; without an additional specific alert, we wouldn't get anything at all. We do this because we'd rather get one email message if a machine reboots instead of two (a 'down machine' alert email and then an 'it cleared up' resolved alert email).

(We consider some machines sufficiently critical that we don't do this, triggering immediate 'down machine' alerts without waiting to see if it's because of a reboot. But not very many.)

There's an additional reason that I like reboot notifications, which is that I feel they're useful as a diagnostic to explain why a machine suddenly dropped off the network for a while. Whether or not we triggered an explicit alert about the machine disappearing, it did and that may have effects that show up elsewhere (in logs, in user reports, or whatever). With a reboot notification, we immediately know why without having to dig into the situation by hand.

WhyRebootAlerts written at 23:44:12; Add Comment

2019-10-06

Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines

I'll lead with the thing I realized. Several years ago I wrote about how all of our important machines were 'pets' instead of 'cattle'. One of the reasons for that was that people logged in to specific machines by name in order to use them, and so they cared if a particular machine went down (which is my view of the difference between pets and cattle). Due to recent changes in how we run a bunch of our compute servers, we've more or less transformed these compute servers into more or less cattle machines. So here's the story.

We have some general use compute servers, but one of the traditional problems with them has been exactly that they were general use. You couldn't get one to yourself and worse, your work on the machine could be affected by whatever else other people decided to run on it too (fair share scheduling helps with this somewhat, but not completely). So for years we also had what we called 'bookable' compute servers, where you could reserve a machine for yourself for a while. At first this started small, with only a few machines, but then it started growing (and we also started adding machines with GPUs).

This created a steadily increasing problem for us, because we maintained these bookings mostly manually. There was some automation to send us email when a machine's booking status had to change, but we had to enter all of the bookings by hand and do the updates by hand. At the start of everything, with only a few machines, there were decent reasons for this; we didn't want to put together a complicated system with a bunch of local software, and it's always dangerous to set up a situation where somewhat fuzzy policies about fairness and so on are enforced through software. By the time we had a bunch of machines, both the actual work and dealing with various policy issues was increasingly a significant burden.

Our eventual solution was to adopt SLURM, configured so that it didn't try to share SLURM nodes (ie compute servers) between people. This isn't how SLURM wants to operate (it'd rather be a fine-grained scheduler), but it's the best approach for us. We moved all of our previous bookable compute servers into SLURM, wrote some documentation on how to use SLURM to basically log in to the nodes, and told everyone they had to switch over to using SLURM whether they liked it or not. Once pushed, people did move and they're probably now using our compute servers more than ever before (partly because they can now get a bunch of them at once for a few days, on the spot).

(We had a previously operated a SLURM cluster with a number of nodes and tried to get people to move over from bookable compute servers to the SLURM cluster, without much success. Given a choice, most people would understandably prefer to use the setup they're already familiar with.)

This switch to allocating and managing access to compute servers through SLURM is only part of what has created genuine cattle; automated allocation of our bookable compute servers wouldn't really have had the same effects. Part of it is that how SLURM operates is that you don't book a machine and then get to log in to it; normally you run a SLURM command and you (or your script) are dumped onto the machine you've been assigned. When you quit or your script exits, your allocation is gone (and you may not be able to get the particular machine back again, if someone else is in the queue). And I feel the final bit of it is that we only let each allocation last for a few days, so no matter what you're getting interrupted before too long.

You can insist on treating SLURM nodes as pets, picking a specific one out to use and caring about it. But SLURM and our entire setup pushes people towards not caring what they get and using nodes only on a transient basis, which means that if one node goes away it's not a big deal.

(This is a good thing because it turns out that some of the donated compute server hardware we're using is a bit flaky and locks up every so often, especially under load. In the days of explicitly booked servers, this would have been all sorts of problems; now people just have to re-submit jobs or whatever, although it's still not great to have their job abruptly die part-way through.)

SlurmHasCreatedCattle written at 23:05:23; Add Comment

2019-10-04

Vim, its defaults, and the problem this presents sysadmins

One of Vim's many options is 'hidden', which may be off or on. The real thing that it does, behind the thicket of technical description, is that hidden controls whether or not you can casually move away from a modified Vim buffer to another one. In most editors this isn't even an option and you always can (you'll get prompted if you try to exit with unsaved changes). In Vim, for historical reasons, this is an option and for further historical reasons it defaults to 'off'.

(The historical reasons are that it wasn't an option in the original BSD vi, which behaved as if hidden was always off. Vim cares a fair bit about compatibility back to historical vi.)

The default of hidden being off gets in the way of doing certain sorts of things in Vim, like making changes to multiple files at once, and it's also at odds with what I want and how I like to work in my editors. So the obvious thing for me to do would be to add 'set hidden' to my .vimrc and move on. However, there is a problem with that, or rather two problems, because I use Vim partly as a sysadmin's editor. By that I mean that I use vi(m) from several different accounts (including the root account) and on many different machines, not all of which have a shared home directory even for my own account (and root always has a local home directory).

In order for 'set hidden' to be useful to me, it needs to be quite pervasive; it needs to work pretty much everywhere I use vim. Otherwise I will periodically trip over situations where it doesn't work, which means that I'll always have to remember the workarounds (and ideally practice them). As a non-default setting, this is at least difficult (although not completely impossible, since we already have an install framework that puts various things into place on all standard machines).

This is why what programs have as defaults matters a lot to sysadmins, in a way that they don't to people who only use one or a few environments on a regular basis. Defaults are all that we can count on everywhere, and our lives are easier if we work within them (we have less to remember, less to customize on as many systems as possible as early as possible, and so on). My life would be a bit easier if Vim had decided that its default was to have hidden on.

PS: The other thing about defaults is that going with the defaults is the course of least discussion in the case of setups used by multiple people, which is an extremely common case for the root account.

Sidebar: The practical flies in my nice theoretical entry

My entry is the theory, but once I actually looked at things it turns out to be not so neat in practice. First off, my own personal .vimrc turns out to already turn on hidden, due to me following the setup guide from Aristotle Pagaltzis' vim-buftabline package. Second, we already install a customized .vimrc in the root account in our standard Ubuntu installs, and reading the comments in it makes it clear that I wrote it. I could probably add 'set hidden' to this and re-deploy it without any objections from my co-workers, and this would cover almost all of the cases that matter to me in practice.

VimDefaultsSysadminProblem written at 22:34:56; Add Comment

2019-10-02

It's useful to record changes that you tried and failed to do

Today, for reasons beyond the scope of this entry, I decided to try out enabling HTTP/2 on our support site. We already have HTTP/2 enabled on another internal Apache server, and both servers run Ubuntu 18.04, so I expected no problems. While I could enable everything fine and restart Apache, to my surprise I didn't get HTTP/2 on the site. Inspecting the Apache error log showed the answer:

[http2:warn] [pid 10400] AH10034: The mpm module (prefork.c) is not supported by mod_http2. The mpm determines how things are processed in your server. HTTP/2 has more demands in this regard and the currently selected mpm will just not do. This is an advisory warning. Your server will continue to work, but the HTTP/2 protocol will be inactive.

We're still using the prefork MPM on this server because when we tried to use the event MPM, we ran into a problem that is probably this Apache bug (we suspect that Ubuntu doesn't have the fix for in their 18.04 Apache version). After I found all of this out, I reverted my Apache configuration changes; we'll have to try this later, in 20.04.

We have a 'worklog' system where we record the changes we make and the work we do in email (that gets archived and so on). Since I didn't succeed here and reverted everything involved, there is no change to record, so I first was going to just move on to the next bit of work. Then I rethought that and wrote a worklog message anyway to record my failure and why. Sure, I didn't make a change, but our worklog is our knowledge base (and one way we communicate with each other, including people who are on vacation), and now it contains an explanation of why we don't and can't have HTTP/2 on those our web servers that are using prefork. If or when we come back to deal with HTTP/2 again, we'll have some additional information and context for how things are with it and us.

This is similar to documenting why you didn't do attractive things, but I think of it as somewhat separate. For us, HTTP/2 isn't particularly that sort of an attractive thing; it's just there and it might be nice to turn it on.

(At one level this issue doesn't come up too often because we don't usually fail at changes this way. At another level, perhaps it should come up more often, because we do periodically investigate things, determine that they won't work for some reason, and then quietly move on. I suspect that I wouldn't have thought to write a worklog at all if I had read up on Apache HTTP/2 beforehand and discovered that it didn't work with the prefork MPM. I was biased toward writing a worklog here because I was making an actual change (that I expected to work), which implies a worklog about it.)

RecordingNegativeResults written at 20:49:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.