Wandering Thoughts archives


Using alerts as tests that guard against future errors

On Twitter, I said:

These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).

So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.

(The RCS file alert is a real one; I mentioned it here.)

In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.

As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.

(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)

In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.

This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.

(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)

AlertsAsTestsAndGuards written at 21:35:11; Add Comment


A file permissions and general deployment annoyance with Certbot

The more we use Certbot, the more I become convinced that it isn't written by people who actually operate it in anything like the kind of environment that we do (and perhaps not at all, although I hope that the EFF uses it for their own web serving). I say this because while Certbot works, there are all sorts of little awkward bits around the edges in practical operation (eg). Today's particular issue is a two part issue concerning file permissions on TLS certificates and keys (and this can turn into a general deployment issue).

Certbot stores all of your TLS certificate information under /etc/letsencrypt/live, which is normally owned by root and is root-only (Unix mode 0700). Well, actually, that's false, because normally the contents of that directory hierarchy are only symlinks to /etc/letsencrypt/archive, which is also owned by root and root-only. This works fine for daemons that read TLS certificate material as root, but not all daemons do; in particular, Exim reads them as the Exim user and group.

The first issue is that Certbot adds an extra level of permissions to TLS private keys. As covered by Certbot's documentation, from Certbot version 0.29.0, private keys for certificates are specifically root-only. This means that you can't give Exim access to the TLS keys it needs just by chgrp'ing /etc/letsencrypt/live and /etc/letsencrypt/archive to the Exim group and then making them mode 0750; you must also specifically chgrp and chmod the private key files. This can be automated with a deploy hook script, which will be run when certificates are renewed.

(Documentation for deploy hooks is hidden away in the discussion of renewing certificates.)

The second issue is that deploy hooks do exactly and only what they're documented to do, which means that deploy hooks do not run the first time you get a certificate. After all, the first time is not a renewal, and Certbot said specifically that deploy hooks run on renewal, not 'any time a certificate is issued'. This means that all of your deployment automation, including changing TLS private key permissions so that your daemons can access the keys, won't happen when you get your initial certificate. You get to do it all by hand.

(You can't easily do it by running your deployment script by hand, because your deployment script is probably counting on various environment variables that Certbot sets.)

We currently get out of this by doing the chgrp and chmod by hand when we get our initial TLS certificates; this adds an extra manual step to initial host setup and conversions to Certbot, which is annoying. If we had more intricate deployment, I think we would have to force an immediate renewal after the TLS certificate had been issued, and to avoid potentially running into rate limits we might want to make our first TLS certificate be a test certificate. Conveniently, there are already other reasons to do this.

CertbotPermissionsAnnoyance written at 00:31:18; Add Comment


Finding metrics that are missing labels in Prometheus (for alert metrics)

One of the things you can abuse metrics for in Prometheus is to configure different alert levels, alert destinations, and so on for different labels within the same metric, as I wrote about back in my entry on using group_* vector matching for database lookups. The example in that entry used two metrics for filesystems, our_zfs_avail_gb and our_zfs_minfree_gb, the former showing the current available space and the latter describing the alert levels and so on we want. Once we're using metrics this way, one of the interesting questions we could ask is what filesystems don't have a space alert set. As it turns out, we can answer this relatively easily.

The first step is to be precise about what we want. Here, we want to know what 'fs' labels are missing from our_zfs_minfree_gb. A fs label is missing if it's not present in our_zfs_minfree_gb but is present in our_zfs_avail_gb. Since we're talking about sets of labels, answering this requires some sort of set operation.

If our_zfs_minfree_gb only has unique values for the fs label (ie, we only ever set one alert per filesystem), then this is relatively straightforward:

our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb

The our_zfs_avail_gb metric generates our initial set of known fs labels. Then we use UNLESS to subtract the set of all fs labels that are present in our_zfs_minfree_gb. We have to use 'ON(fs)' because the only label we want to match on between the two metrics is the fs label itself.

However, this only works if our_zfs_minfree_gb has no duplicate fs labels. If it does (eg if different people can set their own alerts for the same filesystem), we'd get a 'duplicate series' error from this expression. The usual fix is to use a one to many match, but those can't be combined with set operators like 'unless'. Instead we must get creative. Since all we care about is the labels and not the values, we can use an aggregation operation to give us a single series for each label on the right side of the expression:

our_zfs_avail_gb UNLESS ON(fs)
   count(our_zfs_minfree_gb) by (fs)

As a side effect of what they do, all aggregation operators condense multiple instances of a label value this way. It's very convenient if you just want one instance of it; if you care about the resulting value being one that exists in your underlying metrics you can use max() or min().

You can obviously invert this operation to determine 'phantom' alerts, alerts that have fs labels that don't exist in your underlying metric. That expression is:

count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs)

(Here I'm assuimg our_zfs_minfree_gb has duplicate fs labels; if it doesn't, you get a simpler expression.)

Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.

This general approach can be applied to any two metrics where some label ought to be paired up across both. For instance, you could cross-check that every node_info_uname metric is matched by one or more custom per-host informational metrics that your own software is supposed to generate and expose through the node exporter's textfile collector.

(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)

PrometheusFindUnpairedMetrics written at 00:12:27; Add Comment


Bidirectional NAT and split horizon DNS in our networking setup

Like many other places, we have far too many machines to give them all public IPs (or at least public IPv4 IPs), especially since they're spread across multiple groups and each group should get its own isolated subnet. Our solution is the traditional one; we use RFC 1918 IPv4 address space behind firewalls, give groups subnets within it (these days generally /16s), and put each group in what we call a sandbox. Outgoing traffic from each sandbox subnet is NAT'd so that it comes out from a gateway IP for that sandbox, or sometimes a small range of them.

However, sometimes people quite reasonably want to have some of their sandbox machines reachable from the outside world for various reasons, and also sometimes they need their machines to have unique and stable public IPs for outgoing traffic. To handle both of these cases, we use OpenBSD's support for bidirectional NAT. We have a 'BINAT subnet' in our public IP address space and each BINAT'd machine gets assigned an IP on it; as external traffic goes through our perimeter firewall, it does the necessary translation between internal addresses and external ones. Although all public BINAT IPs are on a single subnet, the internal IPs are scattered all over all of our sandbox subnets. All of this is pretty standard.

(The public BINAT subnet is mostly virtual, although not entirely so; for various peculiar reasons there are a few real machines on it.)

However, this leaves us with a DNS problem for internal machines (machines behind our perimeter firewall) and internal traffic to these BINAT'd machines. People and machines on our networks want to be able to talk to these machines using their public DNS names, but the way our networks are set up, they must use the internal IP addresses to do so; the public BINAT IP addresses don't work. Fortunately we already have a split-horizon DNS setup, because we long ago made the decision to have a private top level domain for all of our sandbox networks, so we use our existing DNS infrastructure to give BINAT'd machines different IP addresses in the internal and external views. The external view gives you the public IP, which works (only) if you come in through our perimeter firewall; the internal view gives you the internal RFC 1918 IP address, which works only inside our networks.

(In a world where new gTLDs are created like popcorn, having our own top level domain isn't necessarily a great idea, but we set this up many years before the profusion of gTLDs started. And I can hope that it will stop before someone decides to grab the one we use. Even if they do grab it, the available evidence suggests that we may not care if we can't resolve public names in it.)

Using split-horizon DNS this way does leave people (including us) with some additional problems. The first one is cached DNS answers, or in general not talking to the right DNS servers. If your machine moves between internal and external networks, it needs to somehow flush and re-resolve these names. Also, if you're on one of our internal networks and you do DNS queries to someone else's DNS server, you'll wind up with the public IPs and things won't work. This is a periodic source of problems for users, especially since one of the ways to move on or off our internal networks is to connect to our VPN or disconnect from it.

The other problem is that we need to have internal DNS for any public name that your BINAT'd machine has. This is no problem if you give your BINAT machine a name inside our subdomain, since we already run DNS for that, but if you go off to register your own domain for it (for instance, for a web site), things can get sticky, especially if you want your public DNS to be handled by someone else. We don't have any particularly great solutions for this, although there are decent ones that work in some situations.

(Also, you have to tell us what names your BINAT'd machine has. People don't always do this, probably partly because the need for it isn't necessarily obvious to them. We understand the implications of our BINAT system, but we can't expect that our users do.)

(There's both an obvious reason and a subtle reason why we can't apply BINAT translation to all internal traffic, but that's for another entry because the subtle reason is somewhat complicated.)

BinatAndSplitHorizonDNS written at 22:22:40; Add Comment


Using Wireshark's Statistics menu to get per-host traffic volume

As part of my casual Internet browsing, I recently read 6 Lessons we learned when debugging a scaling problem on GitLab.com. As sort of an aside (although listed as a lesson), the article mentioned Wireshark's Statistics menu and how it can show you per-conversation information (and thus let you find specific sorts of conversations, such as short ones). I didn't think about it much at the time, but this mention stuck in the back of my mind (as such things often do, at least for a while).

Today I had a situation where we had a saturated OpenBSD firewall and I very much wanted to find out roughly what hosts were responsible for the traffic. OpenBSD has per-interface statistics (which let me see that the firewall's interface was saturated with incoming traffic), but it doesn't have anything more granular by default and we didn't have any traffic accounting stuff set up in our PF rules. I tried a plain tcpdump, but this firewall sits in front of enough hosts that the output was overwhelming. As I was thinking unhappy thoughts about trying to write some awk on the fly, a little light went on; perhaps Wireshark could help. So I used tcpdump to capture a minute or two of traffic to a file, copied the capture file over to my Linux machine, and fired up Wireshark.

(Since I only cared about packet sizes, not packet contents, I was able to let tcpdump truncate packets to keep the file size down.)

The answer is yes, Wireshark absolutely had something that could help; the 'Endpoints' option on the Statistics menu gives you a breakdown of the traffic by various endpoint categories, including IPv4 hosts (it will also do it by host+port combination). This immediately pointed me to the high-volume hosts at work.

Using packet captures for this isn't necessarily as useful and precise as real traffic volume information that is measured directly and reliably by the host in some way, and it likely has more overhead. But it has the large virtue that we can use it in any situation where we can run tcpdump for a while, and almost everything has tcpdump. I can use it with our OpenBSD firewalls to find traffic sources, I can use it with our Linux fileservers to figure out which NFS clients are doing a high volume of read or write IO, and I'm sure I can use it in plenty of other situations too.

(One that just occurred to me is trying to find out who is doing an unusually large number of DNS queries to our DNS servers. We don't have query logging, but we can capture a couple of minutes of traffic to port 53.)

Although I wish we hadn't had this problem today, I'm glad that I now have another tool for troubleshooting problems. And I'm glad that I read that article and its mention of Wireshark stuck in my mind. I really do never know when this stuff will come in handy.

WiresharkTrafficVolume written at 00:48:43; Add Comment


Another way to do easy configuration for lots of Prometheus Blackbox checks

Early on in our use of Prometheus, I wrote up a scheme for easy configuration of lots of Blackbox checks where I encoded the name of the Blackbox module to use in the names of the targets you configured, and then extracted them with relabeling. The result gave you target names that looked like:

 - ssh_banner,somehost:22
 - http_2xx,http://somewhere/url

This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).

This works, but I've learned that there is another approach that is more natural and perhaps clearer, namely adding explicit additional labels to your targets and then using those labels in relabeling to determine things like the Blackbox module or even the target to check.

Let's start with the basics (since I didn't know this for a while), which is that a Prometheus 'targets' section of statically configured targets can have additional labels specified. The ostensible purpose of this (covered in the documentation) is to attach additional labels to all metrics scraped from the targets:

- targets:
  - type: external

(My initial use of this was to explicitly label some of the hosts we check as off-network hosts, because check failure for them is different from failure for our local machines.)

However, as covered in this prometheus-users message from Ben Kochie, these additional labels are available at the start of the scrape, and so you can use relabeling to turn them into things like what Blackbox module to use. For example, suppose you add a 'module: ssh_banner' label to a set of targets that you want checked with that Blackbox module, and then have a relabeling configuration like the following:

# Set the target from the address,
# as usual
- source_labels: [__address__]
  target_label: __param_target

# Set the Blackbox module from
# the 'module:' label
- source_labels: [module]
  target_label: __param_module

# And now point the address to a
# local Blackbox as usual.
- target_label: __address__

(As a disclaimer, I haven't actually tested this snippet.)

I see advantages and disadvantages to this approach. One advantage is that it's likely to be more clearer and normal. People are (or should be) used to attaching extra labels to static targets, and it's clearly documented, so the only magic and mystery is how your additional module label takes effect. While I like my original syntax, it's clearly more magical and unusual; you're going to have to read the relabeling configuration to understand what's going on and how to write additional things.

One drawback is that it pretty much forces you to group checks by module instead of by target. With my scheme, you can list several checks for a host together:

- targets:
  - ssh_banner,host:22
  - smtp_banner,host:25
  - http_2xx,http://host/url

With an explicit label-based approach to selecting the module, each of these has to be in a separate static configuration section because they each need a different module label. On the other hand, this pushes you toward listing all of your checks for a given Blackbox module in one spot.

A place where this can be an active drawback is if you need to vary additional labels for groups of targets, especially across modules. For instance, if you want to attach a 'dc' label to all Blackbox metrics from a group of hosts, you now need to split up those per module sections (with a 'module' label) into multiple sections, one for each combination of module and dc. This could easily get pretty verbose (although it might not matter if you're automatically generating this from external configuration information).

I probably won't be changing our configuration from my current trick to this more straightforward approach, but I'm going to bear it in mind for future use. Partly this is because our setup already exists and works, and partly it's because we use some additional labels now and I want to preserve our freedom to easily use more in the future.

PrometheusBlackboxBulkChecksII written at 22:41:03; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.