Using alerts as tests that guard against future errors
On Twitter, I said:
These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).
So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.
(The RCS file alert is a real one; I mentioned it here.)
In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.
As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.
(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)
In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.
This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.
(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)
A file permissions and general deployment annoyance with Certbot
The more we use Certbot, the more I become convinced that it isn't written by people who actually operate it in anything like the kind of environment that we do (and perhaps not at all, although I hope that the EFF uses it for their own web serving). I say this because while Certbot works, there are all sorts of little awkward bits around the edges in practical operation (eg). Today's particular issue is a two part issue concerning file permissions on TLS certificates and keys (and this can turn into a general deployment issue).
Certbot stores all of your TLS certificate information under
/etc/letsencrypt/live, which is normally owned by root and is
root-only (Unix mode 0700). Well, actually, that's false, because
normally the contents of that directory hierarchy are only symlinks
/etc/letsencrypt/archive, which is also owned by root and
root-only. This works fine for daemons that read TLS certificate
material as root, but not all daemons do; in particular, Exim reads
them as the Exim user and group.
The first issue is that Certbot adds an extra level of permissions
to TLS private keys. As covered by Certbot's documentation, from
Certbot version 0.29.0, private keys for certificates are specifically
root-only. This means that you can't give Exim access to the TLS
keys it needs just by chgrp'ing
/etc/letsencrypt/archive to the Exim group and then making them
mode 0750; you must also specifically chgrp and chmod the private
key files. This can be automated with a deploy hook script, which
will be run when certificates are renewed.
(Documentation for deploy hooks is hidden away in the discussion of renewing certificates.)
The second issue is that deploy hooks do exactly and only what they're documented to do, which means that deploy hooks do not run the first time you get a certificate. After all, the first time is not a renewal, and Certbot said specifically that deploy hooks run on renewal, not 'any time a certificate is issued'. This means that all of your deployment automation, including changing TLS private key permissions so that your daemons can access the keys, won't happen when you get your initial certificate. You get to do it all by hand.
(You can't easily do it by running your deployment script by hand, because your deployment script is probably counting on various environment variables that Certbot sets.)
We currently get out of this by doing the chgrp and chmod by hand when we get our initial TLS certificates; this adds an extra manual step to initial host setup and conversions to Certbot, which is annoying. If we had more intricate deployment, I think we would have to force an immediate renewal after the TLS certificate had been issued, and to avoid potentially running into rate limits we might want to make our first TLS certificate be a test certificate. Conveniently, there are already other reasons to do this.
Finding metrics that are missing labels in Prometheus (for alert metrics)
One of the things you can abuse metrics for in Prometheus is to
configure different alert levels, alert destinations, and so on for
different labels within the same metric, as I wrote about back in
my entry on using group_* vector matching for database lookups. The example in that entry used two metrics
the former showing the current available space and the latter
describing the alert levels and so on we want. Once we're using
metrics this way, one of the interesting questions we could ask is
what filesystems don't have a space alert set. As it turns out, we
can answer this relatively easily.
The first step is to be precise about what we want. Here, we want
to know what '
fs' labels are missing from
fs label is missing if it's not present in
but is present in
our_zfs_avail_gb. Since we're talking about
sets of labels, answering this requires some sort of set operation.
our_zfs_minfree_gb only has unique values for the
(ie, we only ever set one alert per filesystem), then this is
our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb
our_zfs_avail_gb metric generates our initial set of known
fs labels. Then we use UNLESS to subtract the set of all
labels that are present in
our_zfs_minfree_gb. We have to use
ON(fs)' because the only label we want to match on between the
two metrics is the
fs label itself.
However, this only works if
our_zfs_minfree_gb has no duplicate
fs labels. If it does (eg if different people can set their own
alerts for the same filesystem), we'd get a 'duplicate series' error
from this expression. The usual fix is to use a one to many match,
but those can't be combined with set operators
unless'. Instead we must get creative. Since all we care
about is the labels and not the values, we can use an aggregation
to give us a single series for each label on the right side of the
our_zfs_avail_gb UNLESS ON(fs) count(our_zfs_minfree_gb) by (fs)
As a side effect of what they do, all aggregation operators condense
multiple instances of a label value this way. It's very convenient
if you just want one instance of it; if you care about the resulting
value being one that exists in your underlying metrics you can use
You can obviously invert this operation to determine 'phantom' alerts,
alerts that have
fs labels that don't exist in your underlying metric.
That expression is:
count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs) our_zfs_avail_gb
(Here I'm assuimg
our_zfs_minfree_gb has duplicate
if it doesn't, you get a simpler expression.)
Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.
This general approach can be applied to any two metrics where some
label ought to be paired up across both. For instance, you could
cross-check that every
node_info_uname metric is matched by one
or more custom per-host informational metrics that your own software
is supposed to generate and expose through the node exporter's
(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)
Bidirectional NAT and split horizon DNS in our networking setup
Like many other places, we have far too many machines to give them all public IPs (or at least public IPv4 IPs), especially since they're spread across multiple groups and each group should get its own isolated subnet. Our solution is the traditional one; we use RFC 1918 IPv4 address space behind firewalls, give groups subnets within it (these days generally /16s), and put each group in what we call a sandbox. Outgoing traffic from each sandbox subnet is NAT'd so that it comes out from a gateway IP for that sandbox, or sometimes a small range of them.
However, sometimes people quite reasonably want to have some of their sandbox machines reachable from the outside world for various reasons, and also sometimes they need their machines to have unique and stable public IPs for outgoing traffic. To handle both of these cases, we use OpenBSD's support for bidirectional NAT. We have a 'BINAT subnet' in our public IP address space and each BINAT'd machine gets assigned an IP on it; as external traffic goes through our perimeter firewall, it does the necessary translation between internal addresses and external ones. Although all public BINAT IPs are on a single subnet, the internal IPs are scattered all over all of our sandbox subnets. All of this is pretty standard.
(The public BINAT subnet is mostly virtual, although not entirely so; for various peculiar reasons there are a few real machines on it.)
However, this leaves us with a DNS problem for internal machines (machines behind our perimeter firewall) and internal traffic to these BINAT'd machines. People and machines on our networks want to be able to talk to these machines using their public DNS names, but the way our networks are set up, they must use the internal IP addresses to do so; the public BINAT IP addresses don't work. Fortunately we already have a split-horizon DNS setup, because we long ago made the decision to have a private top level domain for all of our sandbox networks, so we use our existing DNS infrastructure to give BINAT'd machines different IP addresses in the internal and external views. The external view gives you the public IP, which works (only) if you come in through our perimeter firewall; the internal view gives you the internal RFC 1918 IP address, which works only inside our networks.
(In a world where new gTLDs are created like popcorn, having our own top level domain isn't necessarily a great idea, but we set this up many years before the profusion of gTLDs started. And I can hope that it will stop before someone decides to grab the one we use. Even if they do grab it, the available evidence suggests that we may not care if we can't resolve public names in it.)
Using split-horizon DNS this way does leave people (including us) with some additional problems. The first one is cached DNS answers, or in general not talking to the right DNS servers. If your machine moves between internal and external networks, it needs to somehow flush and re-resolve these names. Also, if you're on one of our internal networks and you do DNS queries to someone else's DNS server, you'll wind up with the public IPs and things won't work. This is a periodic source of problems for users, especially since one of the ways to move on or off our internal networks is to connect to our VPN or disconnect from it.
The other problem is that we need to have internal DNS for any public name that your BINAT'd machine has. This is no problem if you give your BINAT machine a name inside our subdomain, since we already run DNS for that, but if you go off to register your own domain for it (for instance, for a web site), things can get sticky, especially if you want your public DNS to be handled by someone else. We don't have any particularly great solutions for this, although there are decent ones that work in some situations.
(Also, you have to tell us what names your BINAT'd machine has. People don't always do this, probably partly because the need for it isn't necessarily obvious to them. We understand the implications of our BINAT system, but we can't expect that our users do.)
(There's both an obvious reason and a subtle reason why we can't apply BINAT translation to all internal traffic, but that's for another entry because the subtle reason is somewhat complicated.)
Using Wireshark's Statistics menu to get per-host traffic volume
As part of my casual Internet browsing, I recently read 6 Lessons we learned when debugging a scaling problem on GitLab.com. As sort of an aside (although listed as a lesson), the article mentioned Wireshark's Statistics menu and how it can show you per-conversation information (and thus let you find specific sorts of conversations, such as short ones). I didn't think about it much at the time, but this mention stuck in the back of my mind (as such things often do, at least for a while).
Today I had a situation where we had a saturated OpenBSD firewall
and I very much wanted to find out roughly what hosts were responsible
for the traffic. OpenBSD has per-interface statistics (which let
me see that the firewall's interface was saturated with incoming
traffic), but it doesn't have anything more granular by default and
we didn't have any traffic accounting stuff set up in our PF rules.
I tried a plain
tcpdump, but this firewall sits in front of enough
hosts that the output was overwhelming. As I was thinking unhappy
thoughts about trying to write some awk on the fly, a little light
went on; perhaps Wireshark could help. So I used tcpdump to capture
a minute or two of traffic to a file, copied the capture file over
to my Linux machine, and fired up Wireshark.
(Since I only cared about packet sizes, not packet contents, I was able to let tcpdump truncate packets to keep the file size down.)
The answer is yes, Wireshark absolutely had something that could help; the 'Endpoints' option on the Statistics menu gives you a breakdown of the traffic by various endpoint categories, including IPv4 hosts (it will also do it by host+port combination). This immediately pointed me to the high-volume hosts at work.
Using packet captures for this isn't necessarily as useful and precise as real traffic volume information that is measured directly and reliably by the host in some way, and it likely has more overhead. But it has the large virtue that we can use it in any situation where we can run tcpdump for a while, and almost everything has tcpdump. I can use it with our OpenBSD firewalls to find traffic sources, I can use it with our Linux fileservers to figure out which NFS clients are doing a high volume of read or write IO, and I'm sure I can use it in plenty of other situations too.
(One that just occurred to me is trying to find out who is doing an unusually large number of DNS queries to our DNS servers. We don't have query logging, but we can capture a couple of minutes of traffic to port 53.)
Although I wish we hadn't had this problem today, I'm glad that I now have another tool for troubleshooting problems. And I'm glad that I read that article and its mention of Wireshark stuck in my mind. I really do never know when this stuff will come in handy.
Another way to do easy configuration for lots of Prometheus Blackbox checks
Early on in our use of Prometheus, I wrote up a scheme for easy configuration of lots of Blackbox checks where I encoded the name of the Blackbox module to use in the names of the targets you configured, and then extracted them with relabeling. The result gave you target names that looked like:
- ssh_banner,somehost:22 - http_2xx,http://somewhere/url
This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).
This works, but I've learned that there is another approach that is more natural and perhaps clearer, namely adding explicit additional labels to your targets and then using those labels in relabeling to determine things like the Blackbox module or even the target to check.
Let's start with the basics (since I didn't know this for a while),
which is that a Prometheus '
targets' section of statically
configured targets can have additional labels specified. The
ostensible purpose of this (covered in the documentation)
is to attach additional labels to all metrics scraped from the
- targets: - 188.8.131.52:53 - 184.108.40.206:53 labels: - type: external
(My initial use of this was to explicitly label some of the hosts we check as off-network hosts, because check failure for them is different from failure for our local machines.)
However, as covered in this prometheus-users message from Ben
these additional labels are available at the start of the scrape,
and so you can use relabeling to turn them into things like what
Blackbox module to use. For example, suppose you add a '
ssh_banner' label to a set of targets that you want checked with
that Blackbox module, and then have a relabeling configuration like
# Set the target from the address, # as usual - source_labels: [__address__] target_label: __param_target # Set the Blackbox module from # the 'module:' label - source_labels: [module] target_label: __param_module # And now point the address to a # local Blackbox as usual. - target_label: __address__ replacement: 127.0.0.1:9115
(As a disclaimer, I haven't actually tested this snippet.)
I see advantages and disadvantages to this approach. One advantage
is that it's likely to be more clearer and normal. People are (or
should be) used to attaching extra labels to static targets, and
it's clearly documented, so the only magic and mystery is how your
module label takes effect. While I like my original
syntax, it's clearly more magical and unusual; you're going to have
to read the relabeling configuration to understand what's going on
and how to write additional things.
One drawback is that it pretty much forces you to group checks by module instead of by target. With my scheme, you can list several checks for a host together:
- targets: [...] - ssh_banner,host:22 - smtp_banner,host:25 - http_2xx,http://host/url
With an explicit label-based approach to selecting the module, each
of these has to be in a separate static configuration section because
they each need a different
module label. On the other hand, this
pushes you toward listing all of your checks for a given Blackbox
module in one spot.
A place where this can be an active drawback is if you need to vary
additional labels for groups of targets, especially across modules.
For instance, if you want to attach a '
dc' label to all Blackbox
metrics from a group of hosts, you now need to split up those per
module sections (with a '
module' label) into multiple sections,
one for each combination of
dc. This could easily
get pretty verbose (although it might not matter if you're
automatically generating this from external configuration information).
I probably won't be changing our configuration from my current trick to this more straightforward approach, but I'm going to bear it in mind for future use. Partly this is because our setup already exists and works, and partly it's because we use some additional labels now and I want to preserve our freedom to easily use more in the future.