Wandering Thoughts

2019-10-11

A YAML syntax surprise and trick in Prometheus Alertmanager configuration

In a comment on my entry on doing reboot notifications with Prometheus, Simon noted:

Just a note to say that since Alertmanager v0.16.0, it is possible to group alerts by all labels using "group_by: [...]".

When I saw this syntax in the comment, my eyebrows went up, because I'd never seen any sort of ... syntax in YAML before; I had no idea it was even a thing you could do in YAML, and I didn't know what it really meant. Was it some special syntax that flagged what would normally be a YAML array for special processing, for example? So I scurried off to the Wikipedia YAML entry, then the official YAML site and the specification, and finally the Alertmanager source code (because sometimes I'm a systems programmer).

As it turns out this is explained (more or less) in the current Alertmanager documentation, if you read all of the words. Let me quote them:

To aggregate by all possible labels use the special value '...' as the sole label name, for example:
group_by: ['...']

However, the other part of this documentation is less clear, since it lists things as:

[ group_by: '[' <labelname>, ... ']' ]

What is actually going on here is that although the ... looks like YAML syntax, it's actually just a YAML string. The group_by setting is an array of (YAML) strings, which are normally the Prometheus labels to group by, but if you use the string value '...' all by itself, Alertmanager behaves specially. This can be written in a way that looks like syntax instead of a string because YAML allows a lot of unquoted things to be taken as strings (what YAML calls scalars).

(I'm honestly not sure when you have to quote a YAML string.)

The way that Alertmanager documents this makes it reasonably clear that the '...' is an unusual label, not a bit of YAML syntax, since the documentation both explicitly says so and shows it in quoted form (except in a place where the quotes sort of have a different meaning). However, writing it without the explicit quotes makes things much more confusing unless you're already in tune enough with YAML to get what's going on.

My suspicion is that a lot of people aren't going to be that in tune with YAML, partly because YAML is complex, which makes it easy to believe that there's some aspect of YAML syntax you don't know or don't remember. Certainly this experience has reinforced my view that I should be as explicit as possible in our Prometheus YAML usage, even if it's not necessary under the rules. I should also use a consistent style about whether some things are always quoted or not, instead of varying it around for individual rules, configuration bits, and so on.

(Also I should generally avoid any clever YAML things unless I absolutely have to use them.)

YamlSyntaxSurprise written at 21:46:57; Add Comment

2019-10-08

How we implement reboot notifications when our machines reboot in Prometheus

I wrote yesterday about why we generate alerts that our machines have rebooted, but not about how we do it. It turns out that there are a few little tricks about doing this in Prometheus, especially in an environment where you're using physical servers.

The big issue is that Prometheus isn't actually designed to send notifications; it's designed to have alerts. The difference between a notification and an alert is that you send a notification once and then you're done, while an alert is raised, potentially triggers various sorts of notifications after some delay, and then goes away. To abuse some terms, a notification is edge triggered while an alert is level triggered. To create a notification in a system that's designed for alerts, we basically need to turn the event we want to notify about into a level-triggering condition that we can alert on. This condition needs to be true for a while, so the alert is reliably triggered and sent (even in the face of delays or failure to immediately scrape the server's host agent), but it has to go away again sooner or later (otherwise we will basically have a constantly asserted alert that clutters things up).

So the first thing we need is a condition (ie, a Prometheus expression) that is reliably true if a server has rebooted recently. For Linux machines, what you want to use looks like this:

(node_time_seconds - node_boot_time_seconds) < (19*60) >= (5*60)

This condition is true between five minutes after the server rebooting and 19 minutes, and its value is how long the server has been up (in seconds), which is handy for putting in the actual notification we get. We delay sending the alert until the server has been up for a bit so that if we're repeatedly rebooting the server while working on it, we won't get a deluge of reboot notifications; you could make this shorter if you wanted.

(We turn the alert off after the odd 19 minutes because our alert suppression for large scale issues lingers for 20 minutes after the large scale situation seems to have stopped. By cutting off 'recent reboot' notifications just before that, we avoid getting a bunch of 'X recently rebooted' when a bunch of machines come back up in such a situation.)

The obvious way to write this condition is to use 'time()' instead of 'node_time_seconds'. The problem with this is that what the Linux kernel actually exposes is how long the system has been up (in /proc/uptime), not the absolute time of system boot. The Prometheus host agent turns this relative time into an absolute time, using the server's local time. If we use some other source of (absolute) time to try to re-create the time since reboot (such as Prometheus's idea of the current time), we run into problems if and when the server's clock changes after boot. As they say, ask me how I know; our first version used 'time()' and we had all sorts of delayed reboot notifications and so on when servers rebooted or powered on with bad time.

(This is likely to be less of an issue in virtualized environments because your VMs probably boot up with something close to accurate time.)

The other side of the puzzle is in Alertmanager, and comes in two parts. The first part is simply that we want our alert destination (the receiver) for this type of 'alerts' to not set send_resolved to true, the way our other receivers do; we only want to get email at the start of the 'alert', not when it quietly goes away. The second part is defeating grouping, because Alertmanager is normally very determined to group alerts together while we pretty much want to get one email per 'notification'. Unfortunately you can't tell Alertmanager to group by nothing ('[]'), so instead we have a long list of labels to 'group by' which in practice make each alert unique. The result looks like this:

- match:
    cstype: 'notify'
  group_by: ['alertname', 'cstype', 'host', 'instance', 'job', 'probe', 'sendto']
  receiver: notify-receiver
  group_wait: 0s
  group_interval: 5m

We put the special 'cstype' label on all of our notification type alerts in order to route them to this. Since we don't want to group things together and we do want notifications to be immediate, there's no point in a non-zero group_wait (it would only delay the email). The group_interval is to reduce how much email we'd get if a notification started flapping for some reason.

(The group interval interacts with how soon you trigger notifications, since it will effectively suppress genuine repeated notifications within that time window. This can affect how you want to write the notification alert expressions.)

Our Alertmanager templates have special handling for these notifications. Because they aren't alerts, they generate different email Subject: lines and have message bodies that talk about notifications instead of alerts (and know that there will never be 'resolved' notifications that they need to tell us about).

All in all using Prometheus and Alertmanager for this is a bit of a hack, but it works (and works well) and doing it this way saves us from having to build a second system for it. And, as I've mentioned before, this way Prometheus handles dealing with state for us (including the state of 'there is some sort of large scale issue going on, we don't need to be deluged with notes about machines booting up').

PrometheusDoingRebootAlerts written at 21:11:12; Add Comment

2019-10-07

Why we generate alert notifications about our machines having rebooted

Part of our Prometheus alerts is an alert that triggers whenever a machine has been recently rebooted. My impression is that having such alerts these days is unusual, so today I'm writing up the two reasons why we have this alert.

(This is an 'alert' in the sense that all of the output from our Prometheus and Alertmanager is an 'alert', but it is not an alert in the sense of bothering someone outside of working hours. All of our alerts go only to email, and we only pay attention to email during working hours.)

The first reason is that our machines aren't normally supposed to reboot (even most of the ones that are effectively cattle instead of pets, although there are some exceptions). Any unexpected reboot is an anomaly that we want to investigate to try to figure out what's going on. Did we have a power glitch in the middle of the night? Did something run into a kernel panic? And so on. Our mechanism for getting notified about these anomalies is email and the easiest way to send that email is as an 'alert'.

But that's only part of the story, because we don't just monitor these machines to see if they reboot, we also monitor them to see if they go down and trigger alerts if they do. Our machines don't take forever to reboot, but with all of the twiddling around the modern BIOSes perform they do take long enough that our regular 'the machine is down' alerts should fire. So the second reason that we have a specific reboot alert is because we delay the regular 'machine is down' alerts for long enough that they won't actually fire if the machine is just rebooting immediately; without an additional specific alert, we wouldn't get anything at all. We do this because we'd rather get one email message if a machine reboots instead of two (a 'down machine' alert email and then an 'it cleared up' resolved alert email).

(We consider some machines sufficiently critical that we don't do this, triggering immediate 'down machine' alerts without waiting to see if it's because of a reboot. But not very many.)

There's an additional reason that I like reboot notifications, which is that I feel they're useful as a diagnostic to explain why a machine suddenly dropped off the network for a while. Whether or not we triggered an explicit alert about the machine disappearing, it did and that may have effects that show up elsewhere (in logs, in user reports, or whatever). With a reboot notification, we immediately know why without having to dig into the situation by hand.

WhyRebootAlerts written at 23:44:12; Add Comment

2019-10-06

Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines

I'll lead with the thing I realized. Several years ago I wrote about how all of our important machines were 'pets' instead of 'cattle'. One of the reasons for that was that people logged in to specific machines by name in order to use them, and so they cared if a particular machine went down (which is my view of the difference between pets and cattle). Due to recent changes in how we run a bunch of our compute servers, we've more or less transformed these compute servers into more or less cattle machines. So here's the story.

We have some general use compute servers, but one of the traditional problems with them has been exactly that they were general use. You couldn't get one to yourself and worse, your work on the machine could be affected by whatever else other people decided to run on it too (fair share scheduling helps with this somewhat, but not completely). So for years we also had what we called 'bookable' compute servers, where you could reserve a machine for yourself for a while. At first this started small, with only a few machines, but then it started growing (and we also started adding machines with GPUs).

This created a steadily increasing problem for us, because we maintained these bookings mostly manually. There was some automation to send us email when a machine's booking status had to change, but we had to enter all of the bookings by hand and do the updates by hand. At the start of everything, with only a few machines, there were decent reasons for this; we didn't want to put together a complicated system with a bunch of local software, and it's always dangerous to set up a situation where somewhat fuzzy policies about fairness and so on are enforced through software. By the time we had a bunch of machines, both the actual work and dealing with various policy issues was increasingly a significant burden.

Our eventual solution was to adopt SLURM, configured so that it didn't try to share SLURM nodes (ie compute servers) between people. This isn't how SLURM wants to operate (it'd rather be a fine-grained scheduler), but it's the best approach for us. We moved all of our previous bookable compute servers into SLURM, wrote some documentation on how to use SLURM to basically log in to the nodes, and told everyone they had to switch over to using SLURM whether they liked it or not. Once pushed, people did move and they're probably now using our compute servers more than ever before (partly because they can now get a bunch of them at once for a few days, on the spot).

(We had a previously operated a SLURM cluster with a number of nodes and tried to get people to move over from bookable compute servers to the SLURM cluster, without much success. Given a choice, most people would understandably prefer to use the setup they're already familiar with.)

This switch to allocating and managing access to compute servers through SLURM is only part of what has created genuine cattle; automated allocation of our bookable compute servers wouldn't really have had the same effects. Part of it is that how SLURM operates is that you don't book a machine and then get to log in to it; normally you run a SLURM command and you (or your script) are dumped onto the machine you've been assigned. When you quit or your script exits, your allocation is gone (and you may not be able to get the particular machine back again, if someone else is in the queue). And I feel the final bit of it is that we only let each allocation last for a few days, so no matter what you're getting interrupted before too long.

You can insist on treating SLURM nodes as pets, picking a specific one out to use and caring about it. But SLURM and our entire setup pushes people towards not caring what they get and using nodes only on a transient basis, which means that if one node goes away it's not a big deal.

(This is a good thing because it turns out that some of the donated compute server hardware we're using is a bit flaky and locks up every so often, especially under load. In the days of explicitly booked servers, this would have been all sorts of problems; now people just have to re-submit jobs or whatever, although it's still not great to have their job abruptly die part-way through.)

SlurmHasCreatedCattle written at 23:05:23; Add Comment

2019-10-04

Vim, its defaults, and the problem this presents sysadmins

One of Vim's many options is 'hidden', which may be off or on. The real thing that it does, behind the thicket of technical description, is that hidden controls whether or not you can casually move away from a modified Vim buffer to another one. In most editors this isn't even an option and you always can (you'll get prompted if you try to exit with unsaved changes). In Vim, for historical reasons, this is an option and for further historical reasons it defaults to 'off'.

(The historical reasons are that it wasn't an option in the original BSD vi, which behaved as if hidden was always off. Vim cares a fair bit about compatibility back to historical vi.)

The default of hidden being off gets in the way of doing certain sorts of things in Vim, like making changes to multiple files at once, and it's also at odds with what I want and how I like to work in my editors. So the obvious thing for me to do would be to add 'set hidden' to my .vimrc and move on. However, there is a problem with that, or rather two problems, because I use Vim partly as a sysadmin's editor. By that I mean that I use vi(m) from several different accounts (including the root account) and on many different machines, not all of which have a shared home directory even for my own account (and root always has a local home directory).

In order for 'set hidden' to be useful to me, it needs to be quite pervasive; it needs to work pretty much everywhere I use vim. Otherwise I will periodically trip over situations where it doesn't work, which means that I'll always have to remember the workarounds (and ideally practice them). As a non-default setting, this is at least difficult (although not completely impossible, since we already have an install framework that puts various things into place on all standard machines).

This is why what programs have as defaults matters a lot to sysadmins, in a way that they don't to people who only use one or a few environments on a regular basis. Defaults are all that we can count on everywhere, and our lives are easier if we work within them (we have less to remember, less to customize on as many systems as possible as early as possible, and so on). My life would be a bit easier if Vim had decided that its default was to have hidden on.

PS: The other thing about defaults is that going with the defaults is the course of least discussion in the case of setups used by multiple people, which is an extremely common case for the root account.

Sidebar: The practical flies in my nice theoretical entry

My entry is the theory, but once I actually looked at things it turns out to be not so neat in practice. First off, my own personal .vimrc turns out to already turn on hidden, due to me following the setup guide from Aristotle Pagaltzis' vim-buftabline package. Second, we already install a customized .vimrc in the root account in our standard Ubuntu installs, and reading the comments in it makes it clear that I wrote it. I could probably add 'set hidden' to this and re-deploy it without any objections from my co-workers, and this would cover almost all of the cases that matter to me in practice.

VimDefaultsSysadminProblem written at 22:34:56; Add Comment

2019-10-02

It's useful to record changes that you tried and failed to do

Today, for reasons beyond the scope of this entry, I decided to try out enabling HTTP/2 on our support site. We already have HTTP/2 enabled on another internal Apache server, and both servers run Ubuntu 18.04, so I expected no problems. While I could enable everything fine and restart Apache, to my surprise I didn't get HTTP/2 on the site. Inspecting the Apache error log showed the answer:

[http2:warn] [pid 10400] AH10034: The mpm module (prefork.c) is not supported by mod_http2. The mpm determines how things are processed in your server. HTTP/2 has more demands in this regard and the currently selected mpm will just not do. This is an advisory warning. Your server will continue to work, but the HTTP/2 protocol will be inactive.

We're still using the prefork MPM on this server because when we tried to use the event MPM, we ran into a problem that is probably this Apache bug (we suspect that Ubuntu doesn't have the fix for in their 18.04 Apache version). After I found all of this out, I reverted my Apache configuration changes; we'll have to try this later, in 20.04.

We have a 'worklog' system where we record the changes we make and the work we do in email (that gets archived and so on). Since I didn't succeed here and reverted everything involved, there is no change to record, so I first was going to just move on to the next bit of work. Then I rethought that and wrote a worklog message anyway to record my failure and why. Sure, I didn't make a change, but our worklog is our knowledge base (and one way we communicate with each other, including people who are on vacation), and now it contains an explanation of why we don't and can't have HTTP/2 on those our web servers that are using prefork. If or when we come back to deal with HTTP/2 again, we'll have some additional information and context for how things are with it and us.

This is similar to documenting why you didn't do attractive things, but I think of it as somewhat separate. For us, HTTP/2 isn't particularly that sort of an attractive thing; it's just there and it might be nice to turn it on.

(At one level this issue doesn't come up too often because we don't usually fail at changes this way. At another level, perhaps it should come up more often, because we do periodically investigate things, determine that they won't work for some reason, and then quietly move on. I suspect that I wouldn't have thought to write a worklog at all if I had read up on Apache HTTP/2 beforehand and discovered that it didn't work with the prefork MPM. I was biased toward writing a worklog here because I was making an actual change (that I expected to work), which implies a worklog about it.)

RecordingNegativeResults written at 20:49:49; Add Comment

2019-09-30

Using alerts as tests that guard against future errors

On Twitter, I said:

These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).

So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.

(The RCS file alert is a real one; I mentioned it here.)

In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.

As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.

(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)

In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.

This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.

(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)

AlertsAsTestsAndGuards written at 21:35:11; Add Comment

2019-09-27

A file permissions and general deployment annoyance with Certbot

The more we use Certbot, the more I become convinced that it isn't written by people who actually operate it in anything like the kind of environment that we do (and perhaps not at all, although I hope that the EFF uses it for their own web serving). I say this because while Certbot works, there are all sorts of little awkward bits around the edges in practical operation (eg). Today's particular issue is a two part issue concerning file permissions on TLS certificates and keys (and this can turn into a general deployment issue).

Certbot stores all of your TLS certificate information under /etc/letsencrypt/live, which is normally owned by root and is root-only (Unix mode 0700). Well, actually, that's false, because normally the contents of that directory hierarchy are only symlinks to /etc/letsencrypt/archive, which is also owned by root and root-only. This works fine for daemons that read TLS certificate material as root, but not all daemons do; in particular, Exim reads them as the Exim user and group.

The first issue is that Certbot adds an extra level of permissions to TLS private keys. As covered by Certbot's documentation, from Certbot version 0.29.0, private keys for certificates are specifically root-only. This means that you can't give Exim access to the TLS keys it needs just by chgrp'ing /etc/letsencrypt/live and /etc/letsencrypt/archive to the Exim group and then making them mode 0750; you must also specifically chgrp and chmod the private key files. This can be automated with a deploy hook script, which will be run when certificates are renewed.

(Documentation for deploy hooks is hidden away in the discussion of renewing certificates.)

The second issue is that deploy hooks do exactly and only what they're documented to do, which means that deploy hooks do not run the first time you get a certificate. After all, the first time is not a renewal, and Certbot said specifically that deploy hooks run on renewal, not 'any time a certificate is issued'. This means that all of your deployment automation, including changing TLS private key permissions so that your daemons can access the keys, won't happen when you get your initial certificate. You get to do it all by hand.

(You can't easily do it by running your deployment script by hand, because your deployment script is probably counting on various environment variables that Certbot sets.)

We currently get out of this by doing the chgrp and chmod by hand when we get our initial TLS certificates; this adds an extra manual step to initial host setup and conversions to Certbot, which is annoying. If we had more intricate deployment, I think we would have to force an immediate renewal after the TLS certificate had been issued, and to avoid potentially running into rate limits we might want to make our first TLS certificate be a test certificate. Conveniently, there are already other reasons to do this.

CertbotPermissionsAnnoyance written at 00:31:18; Add Comment

2019-09-17

Finding metrics that are missing labels in Prometheus (for alert metrics)

One of the things you can abuse metrics for in Prometheus is to configure different alert levels, alert destinations, and so on for different labels within the same metric, as I wrote about back in my entry on using group_* vector matching for database lookups. The example in that entry used two metrics for filesystems, our_zfs_avail_gb and our_zfs_minfree_gb, the former showing the current available space and the latter describing the alert levels and so on we want. Once we're using metrics this way, one of the interesting questions we could ask is what filesystems don't have a space alert set. As it turns out, we can answer this relatively easily.

The first step is to be precise about what we want. Here, we want to know what 'fs' labels are missing from our_zfs_minfree_gb. A fs label is missing if it's not present in our_zfs_minfree_gb but is present in our_zfs_avail_gb. Since we're talking about sets of labels, answering this requires some sort of set operation.

If our_zfs_minfree_gb only has unique values for the fs label (ie, we only ever set one alert per filesystem), then this is relatively straightforward:

our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb

The our_zfs_avail_gb metric generates our initial set of known fs labels. Then we use UNLESS to subtract the set of all fs labels that are present in our_zfs_minfree_gb. We have to use 'ON(fs)' because the only label we want to match on between the two metrics is the fs label itself.

However, this only works if our_zfs_minfree_gb has no duplicate fs labels. If it does (eg if different people can set their own alerts for the same filesystem), we'd get a 'duplicate series' error from this expression. The usual fix is to use a one to many match, but those can't be combined with set operators like 'unless'. Instead we must get creative. Since all we care about is the labels and not the values, we can use an aggregation operation to give us a single series for each label on the right side of the expression:

our_zfs_avail_gb UNLESS ON(fs)
   count(our_zfs_minfree_gb) by (fs)

As a side effect of what they do, all aggregation operators condense multiple instances of a label value this way. It's very convenient if you just want one instance of it; if you care about the resulting value being one that exists in your underlying metrics you can use max() or min().

You can obviously invert this operation to determine 'phantom' alerts, alerts that have fs labels that don't exist in your underlying metric. That expression is:

count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs)
   our_zfs_avail_gb

(Here I'm assuimg our_zfs_minfree_gb has duplicate fs labels; if it doesn't, you get a simpler expression.)

Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.

This general approach can be applied to any two metrics where some label ought to be paired up across both. For instance, you could cross-check that every node_info_uname metric is matched by one or more custom per-host informational metrics that your own software is supposed to generate and expose through the node exporter's textfile collector.

(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)

PrometheusFindUnpairedMetrics written at 00:12:27; Add Comment

2019-09-13

Bidirectional NAT and split horizon DNS in our networking setup

Like many other places, we have far too many machines to give them all public IPs (or at least public IPv4 IPs), especially since they're spread across multiple groups and each group should get its own isolated subnet. Our solution is the traditional one; we use RFC 1918 IPv4 address space behind firewalls, give groups subnets within it (these days generally /16s), and put each group in what we call a sandbox. Outgoing traffic from each sandbox subnet is NAT'd so that it comes out from a gateway IP for that sandbox, or sometimes a small range of them.

However, sometimes people quite reasonably want to have some of their sandbox machines reachable from the outside world for various reasons, and also sometimes they need their machines to have unique and stable public IPs for outgoing traffic. To handle both of these cases, we use OpenBSD's support for bidirectional NAT. We have a 'BINAT subnet' in our public IP address space and each BINAT'd machine gets assigned an IP on it; as external traffic goes through our perimeter firewall, it does the necessary translation between internal addresses and external ones. Although all public BINAT IPs are on a single subnet, the internal IPs are scattered all over all of our sandbox subnets. All of this is pretty standard.

(The public BINAT subnet is mostly virtual, although not entirely so; for various peculiar reasons there are a few real machines on it.)

However, this leaves us with a DNS problem for internal machines (machines behind our perimeter firewall) and internal traffic to these BINAT'd machines. People and machines on our networks want to be able to talk to these machines using their public DNS names, but the way our networks are set up, they must use the internal IP addresses to do so; the public BINAT IP addresses don't work. Fortunately we already have a split-horizon DNS setup, because we long ago made the decision to have a private top level domain for all of our sandbox networks, so we use our existing DNS infrastructure to give BINAT'd machines different IP addresses in the internal and external views. The external view gives you the public IP, which works (only) if you come in through our perimeter firewall; the internal view gives you the internal RFC 1918 IP address, which works only inside our networks.

(In a world where new gTLDs are created like popcorn, having our own top level domain isn't necessarily a great idea, but we set this up many years before the profusion of gTLDs started. And I can hope that it will stop before someone decides to grab the one we use. Even if they do grab it, the available evidence suggests that we may not care if we can't resolve public names in it.)

Using split-horizon DNS this way does leave people (including us) with some additional problems. The first one is cached DNS answers, or in general not talking to the right DNS servers. If your machine moves between internal and external networks, it needs to somehow flush and re-resolve these names. Also, if you're on one of our internal networks and you do DNS queries to someone else's DNS server, you'll wind up with the public IPs and things won't work. This is a periodic source of problems for users, especially since one of the ways to move on or off our internal networks is to connect to our VPN or disconnect from it.

The other problem is that we need to have internal DNS for any public name that your BINAT'd machine has. This is no problem if you give your BINAT machine a name inside our subdomain, since we already run DNS for that, but if you go off to register your own domain for it (for instance, for a web site), things can get sticky, especially if you want your public DNS to be handled by someone else. We don't have any particularly great solutions for this, although there are decent ones that work in some situations.

(Also, you have to tell us what names your BINAT'd machine has. People don't always do this, probably partly because the need for it isn't necessarily obvious to them. We understand the implications of our BINAT system, but we can't expect that our users do.)

(There's both an obvious reason and a subtle reason why we can't apply BINAT translation to all internal traffic, but that's for another entry because the subtle reason is somewhat complicated.)

BinatAndSplitHorizonDNS written at 22:22:40; Add Comment

(Previous 10 or go back to September 2019 at 2019/09/04)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.