A YAML syntax surprise and trick in Prometheus Alertmanager configuration
In a comment on my entry on doing reboot notifications with Prometheus, Simon noted:
Just a note to say that since Alertmanager v0.16.0, it is possible to group alerts by all labels using "group_by: [...]".
When I saw this syntax in the comment, my eyebrows went up, because
I'd never seen any sort of
... syntax in YAML before; I had no
idea it was even a thing you could do in YAML, and I didn't know
what it really meant. Was it some special syntax that flagged what
would normally be a YAML array for special processing, for example?
So I scurried off to the Wikipedia YAML entry, then the official YAML site
and the specification, and finally the Alertmanager
source code (because sometimes I'm a systems programmer).
As it turns out this is explained (more or less) in the current Alertmanager documentation, if you read all of the words. Let me quote them:
To aggregate by all possible labels use the special value '...' as the sole label name, for example:
However, the other part of this documentation is less clear, since it lists things as:
[ group_by: '[' <labelname>, ... ']' ]
What is actually going on here is that although the
like YAML syntax, it's actually just a YAML string. The group_by
setting is an array of (YAML) strings, which are normally the
Prometheus labels to group by, but if you use the string value '...'
all by itself, Alertmanager behaves specially. This can be written
in a way that looks like syntax instead of a string because YAML
allows a lot of unquoted things to be taken as strings (what YAML
(I'm honestly not sure when you have to quote a YAML string.)
The way that Alertmanager documents this makes it reasonably clear that the '...' is an unusual label, not a bit of YAML syntax, since the documentation both explicitly says so and shows it in quoted form (except in a place where the quotes sort of have a different meaning). However, writing it without the explicit quotes makes things much more confusing unless you're already in tune enough with YAML to get what's going on.
My suspicion is that a lot of people aren't going to be that in tune with YAML, partly because YAML is complex, which makes it easy to believe that there's some aspect of YAML syntax you don't know or don't remember. Certainly this experience has reinforced my view that I should be as explicit as possible in our Prometheus YAML usage, even if it's not necessary under the rules. I should also use a consistent style about whether some things are always quoted or not, instead of varying it around for individual rules, configuration bits, and so on.
(Also I should generally avoid any clever YAML things unless I absolutely have to use them.)
How we implement reboot notifications when our machines reboot in Prometheus
I wrote yesterday about why we generate alerts that our machines have rebooted, but not about how we do it. It turns out that there are a few little tricks about doing this in Prometheus, especially in an environment where you're using physical servers.
The big issue is that Prometheus isn't actually designed to send notifications; it's designed to have alerts. The difference between a notification and an alert is that you send a notification once and then you're done, while an alert is raised, potentially triggers various sorts of notifications after some delay, and then goes away. To abuse some terms, a notification is edge triggered while an alert is level triggered. To create a notification in a system that's designed for alerts, we basically need to turn the event we want to notify about into a level-triggering condition that we can alert on. This condition needs to be true for a while, so the alert is reliably triggered and sent (even in the face of delays or failure to immediately scrape the server's host agent), but it has to go away again sooner or later (otherwise we will basically have a constantly asserted alert that clutters things up).
So the first thing we need is a condition (ie, a Prometheus expression) that is reliably true if a server has rebooted recently. For Linux machines, what you want to use looks like this:
(node_time_seconds - node_boot_time_seconds) < (19*60) >= (5*60)
This condition is true between five minutes after the server rebooting and 19 minutes, and its value is how long the server has been up (in seconds), which is handy for putting in the actual notification we get. We delay sending the alert until the server has been up for a bit so that if we're repeatedly rebooting the server while working on it, we won't get a deluge of reboot notifications; you could make this shorter if you wanted.
(We turn the alert off after the odd 19 minutes because our alert suppression for large scale issues lingers for 20 minutes after the large scale situation seems to have stopped. By cutting off 'recent reboot' notifications just before that, we avoid getting a bunch of 'X recently rebooted' when a bunch of machines come back up in such a situation.)
The obvious way to write this condition is to use '
node_time_seconds'. The problem with this is that what the
Linux kernel actually exposes is how long the system has been up
/proc/uptime), not the absolute time of system boot. The
Prometheus host agent turns this relative time into an absolute
time, using the server's local time. If we use some other source
of (absolute) time to try to re-create the time since reboot (such
as Prometheus's idea of the current time), we run into problems if
and when the server's clock changes after boot. As they say, ask
me how I know; our first version used '
time()' and we had all
sorts of delayed reboot notifications and so on when servers rebooted
or powered on with bad time.
(This is likely to be less of an issue in virtualized environments because your VMs probably boot up with something close to accurate time.)
The other side of the puzzle is in Alertmanager, and comes in two
parts. The first part is simply that we want our alert destination
(the receiver) for this type of 'alerts' to not set
to true, the way our other receivers do; we only want to get email
at the start of the 'alert', not when it quietly goes away. The
second part is defeating grouping, because Alertmanager is normally
very determined to group alerts together while we pretty much want
to get one email per 'notification'. Unfortunately you can't tell
Alertmanager to group by nothing ('
'), so instead we have a
long list of labels to 'group by' which in practice make each alert
unique. The result looks like this:
- match: cstype: 'notify' group_by: ['alertname', 'cstype', 'host', 'instance', 'job', 'probe', 'sendto'] receiver: notify-receiver group_wait: 0s group_interval: 5m
We put the special '
cstype' label on all of our notification type
alerts in order to route them to this. Since we don't want to group
things together and we do want notifications to be immediate, there's
no point in a non-zero
group_wait (it would only delay the
group_interval is to reduce how much email we'd get
if a notification started flapping for some reason.
(The group interval interacts with how soon you trigger notifications, since it will effectively suppress genuine repeated notifications within that time window. This can affect how you want to write the notification alert expressions.)
Our Alertmanager templates have special handling for these
notifications. Because they aren't alerts, they generate different
Subject: lines and have message bodies that talk about
notifications instead of alerts (and know that there will never
be 'resolved' notifications that they need to tell us about).
All in all using Prometheus and Alertmanager for this is a bit of a hack, but it works (and works well) and doing it this way saves us from having to build a second system for it. And, as I've mentioned before, this way Prometheus handles dealing with state for us (including the state of 'there is some sort of large scale issue going on, we don't need to be deluged with notes about machines booting up').
Why we generate alert notifications about our machines having rebooted
Part of our Prometheus alerts is an alert that triggers whenever a machine has been recently rebooted. My impression is that having such alerts these days is unusual, so today I'm writing up the two reasons why we have this alert.
(This is an 'alert' in the sense that all of the output from our Prometheus and Alertmanager is an 'alert', but it is not an alert in the sense of bothering someone outside of working hours. All of our alerts go only to email, and we only pay attention to email during working hours.)
The first reason is that our machines aren't normally supposed to reboot (even most of the ones that are effectively cattle instead of pets, although there are some exceptions). Any unexpected reboot is an anomaly that we want to investigate to try to figure out what's going on. Did we have a power glitch in the middle of the night? Did something run into a kernel panic? And so on. Our mechanism for getting notified about these anomalies is email and the easiest way to send that email is as an 'alert'.
But that's only part of the story, because we don't just monitor these machines to see if they reboot, we also monitor them to see if they go down and trigger alerts if they do. Our machines don't take forever to reboot, but with all of the twiddling around the modern BIOSes perform they do take long enough that our regular 'the machine is down' alerts should fire. So the second reason that we have a specific reboot alert is because we delay the regular 'machine is down' alerts for long enough that they won't actually fire if the machine is just rebooting immediately; without an additional specific alert, we wouldn't get anything at all. We do this because we'd rather get one email message if a machine reboots instead of two (a 'down machine' alert email and then an 'it cleared up' resolved alert email).
(We consider some machines sufficiently critical that we don't do this, triggering immediate 'down machine' alerts without waiting to see if it's because of a reboot. But not very many.)
There's an additional reason that I like reboot notifications, which is that I feel they're useful as a diagnostic to explain why a machine suddenly dropped off the network for a while. Whether or not we triggered an explicit alert about the machine disappearing, it did and that may have effects that show up elsewhere (in logs, in user reports, or whatever). With a reboot notification, we immediately know why without having to dig into the situation by hand.
Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines
I'll lead with the thing I realized. Several years ago I wrote about how all of our important machines were 'pets' instead of 'cattle'. One of the reasons for that was that people logged in to specific machines by name in order to use them, and so they cared if a particular machine went down (which is my view of the difference between pets and cattle). Due to recent changes in how we run a bunch of our compute servers, we've more or less transformed these compute servers into more or less cattle machines. So here's the story.
We have some general use compute servers, but one of the traditional problems with them has been exactly that they were general use. You couldn't get one to yourself and worse, your work on the machine could be affected by whatever else other people decided to run on it too (fair share scheduling helps with this somewhat, but not completely). So for years we also had what we called 'bookable' compute servers, where you could reserve a machine for yourself for a while. At first this started small, with only a few machines, but then it started growing (and we also started adding machines with GPUs).
This created a steadily increasing problem for us, because we maintained these bookings mostly manually. There was some automation to send us email when a machine's booking status had to change, but we had to enter all of the bookings by hand and do the updates by hand. At the start of everything, with only a few machines, there were decent reasons for this; we didn't want to put together a complicated system with a bunch of local software, and it's always dangerous to set up a situation where somewhat fuzzy policies about fairness and so on are enforced through software. By the time we had a bunch of machines, both the actual work and dealing with various policy issues was increasingly a significant burden.
Our eventual solution was to adopt SLURM, configured so that it didn't try to share SLURM nodes (ie compute servers) between people. This isn't how SLURM wants to operate (it'd rather be a fine-grained scheduler), but it's the best approach for us. We moved all of our previous bookable compute servers into SLURM, wrote some documentation on how to use SLURM to basically log in to the nodes, and told everyone they had to switch over to using SLURM whether they liked it or not. Once pushed, people did move and they're probably now using our compute servers more than ever before (partly because they can now get a bunch of them at once for a few days, on the spot).
(We had a previously operated a SLURM cluster with a number of nodes and tried to get people to move over from bookable compute servers to the SLURM cluster, without much success. Given a choice, most people would understandably prefer to use the setup they're already familiar with.)
This switch to allocating and managing access to compute servers through SLURM is only part of what has created genuine cattle; automated allocation of our bookable compute servers wouldn't really have had the same effects. Part of it is that how SLURM operates is that you don't book a machine and then get to log in to it; normally you run a SLURM command and you (or your script) are dumped onto the machine you've been assigned. When you quit or your script exits, your allocation is gone (and you may not be able to get the particular machine back again, if someone else is in the queue). And I feel the final bit of it is that we only let each allocation last for a few days, so no matter what you're getting interrupted before too long.
You can insist on treating SLURM nodes as pets, picking a specific one out to use and caring about it. But SLURM and our entire setup pushes people towards not caring what they get and using nodes only on a transient basis, which means that if one node goes away it's not a big deal.
(This is a good thing because it turns out that some of the donated compute server hardware we're using is a bit flaky and locks up every so often, especially under load. In the days of explicitly booked servers, this would have been all sorts of problems; now people just have to re-submit jobs or whatever, although it's still not great to have their job abruptly die part-way through.)
Vim, its defaults, and the problem this presents sysadmins
One of Vim's many options is '
may be off or on. The real thing that it does, behind the thicket
of technical description, is that
hidden controls whether or not
you can casually move away from a modified Vim buffer to another
one. In most editors this isn't even an option and you always can
(you'll get prompted if you try to exit with unsaved changes). In
Vim, for historical reasons, this is an option and for further
historical reasons it defaults to 'off'.
(The historical reasons are that it wasn't an option in the original BSD
vi, which behaved as if
hidden was always off. Vim cares a fair bit
about compatibility back to historical vi.)
The default of
hidden being off gets in the way of doing certain
sorts of things in Vim, like making changes to multiple files at
once, and it's also at odds with what
I want and how I like to work in my editors. So the obvious thing
for me to do would be to add '
set hidden' to my
.vimrc and move
on. However, there is a problem with that, or rather two problems,
because I use Vim partly as a sysadmin's editor.
By that I mean that I use vi(m) from several different accounts
root account) and on many different machines, not
all of which have a shared home directory even for my own account
root always has a local home directory).
In order for '
set hidden' to be useful to me, it needs to be quite
pervasive; it needs to work pretty much everywhere I use vim. Otherwise
I will periodically trip over situations where it doesn't work, which
means that I'll always have to remember the workarounds (and ideally
practice them). As a non-default setting, this is at least difficult
(although not completely impossible, since we already have an install
framework that puts various things into place on all standard machines).
This is why what programs have as defaults matters a lot to sysadmins,
in a way that they don't to people who only use one or a few
environments on a regular basis. Defaults are all that we can count on
everywhere, and our lives are easier if we work within them (we have
less to remember, less to customize on as many systems as possible as
early as possible, and so on). My life would be a bit easier if Vim
had decided that its default was to have
PS: The other thing about defaults is that going with the defaults
is the course of least discussion in the case of setups used by
multiple people, which is an extremely common case for the
Sidebar: The practical flies in my nice theoretical entry
My entry is the theory, but once I actually looked at things it
turns out to be not so neat in practice. First off, my own personal
.vimrc turns out to already turn on
hidden, due to me following
the setup guide from Aristotle Pagaltzis' vim-buftabline package. Second, we already install
.vimrc in the root account in our standard Ubuntu
installs, and reading the comments in it makes it clear that I wrote
it. I could probably add '
set hidden' to this and re-deploy it
without any objections from my co-workers, and this would cover
almost all of the cases that matter to me in practice.
It's useful to record changes that you tried and failed to do
Today, for reasons beyond the scope of this entry, I decided to try out enabling HTTP/2 on our support site. We already have HTTP/2 enabled on another internal Apache server, and both servers run Ubuntu 18.04, so I expected no problems. While I could enable everything fine and restart Apache, to my surprise I didn't get HTTP/2 on the site. Inspecting the Apache error log showed the answer:
[http2:warn] [pid 10400] AH10034: The mpm module (prefork.c) is not supported by mod_http2. The mpm determines how things are processed in your server. HTTP/2 has more demands in this regard and the currently selected mpm will just not do. This is an advisory warning. Your server will continue to work, but the HTTP/2 protocol will be inactive.
We're still using the prefork MPM on this server because when we tried to use the event MPM, we ran into a problem that is probably this Apache bug (we suspect that Ubuntu doesn't have the fix for in their 18.04 Apache version). After I found all of this out, I reverted my Apache configuration changes; we'll have to try this later, in 20.04.
We have a 'worklog' system where we record the changes we make and the work we do in email (that gets archived and so on). Since I didn't succeed here and reverted everything involved, there is no change to record, so I first was going to just move on to the next bit of work. Then I rethought that and wrote a worklog message anyway to record my failure and why. Sure, I didn't make a change, but our worklog is our knowledge base (and one way we communicate with each other, including people who are on vacation), and now it contains an explanation of why we don't and can't have HTTP/2 on those our web servers that are using prefork. If or when we come back to deal with HTTP/2 again, we'll have some additional information and context for how things are with it and us.
This is similar to documenting why you didn't do attractive things, but I think of it as somewhat separate. For us, HTTP/2 isn't particularly that sort of an attractive thing; it's just there and it might be nice to turn it on.
(At one level this issue doesn't come up too often because we don't usually fail at changes this way. At another level, perhaps it should come up more often, because we do periodically investigate things, determine that they won't work for some reason, and then quietly move on. I suspect that I wouldn't have thought to write a worklog at all if I had read up on Apache HTTP/2 beforehand and discovered that it didn't work with the prefork MPM. I was biased toward writing a worklog here because I was making an actual change (that I expected to work), which implies a worklog about it.)
Using alerts as tests that guard against future errors
On Twitter, I said:
These days, I think of many of our alerts as tests, like code tests to verify that bugs don't come back. If we broke something in the past and didn't notice or couldn't easily spot what was wrong, we add an alert (and a metric or check for it to use, if necessary).
So we have an alert for 'can we log in with POP3' (guess what I broke once, and surprise, GMail uses POP3 to pull email from us), and one for 'did we forget to commit this RCS file and broke self-serve device registration', and so on.
(The RCS file alert is a real one; I mentioned it here.)
In modern programming, it's conventional that when you find a bug in your code, you usually write a test that checks for it (before you fix the bug). This test is partly to verify that you actually fixed the bug, but it's also there to guard against the bug ever coming back; after all, if you got it wrong once, you might accidentally get it wrong again in the future. You can find a lot of these tests over modern codebases, especially in tricky areas, and if you read the commit logs you can usually find people saying exactly this about the newly added tests.
As sysadmins here, how we operate our systems isn't exactly programming, but I think that some of the same principles apply. Like programmers, we're capable of breaking things or setting up something that is partially but not completely working. When that happens, we can fix it (like programmers fixing a bug) and move on, or we can recognize that if we made a mistake once, we might make the same mistake later (or a similar one that has the same effects), just like issues in programs can reappear.
(If anything, I tend to think that traditional style sysadmins are more prone to re-breaking things than programmers are because we routinely rebuild our 'programs', ie our systems, due to things like operating systems and programs getting upgraded. Every new version of Ubuntu and its accompanying versions of Dovecot, Exim, Apache, and so on is a new chance to recreate old problems, and on top of that we tend to build things with complex interdependencies that we often don't fully understand or realize.)
In this environment, my version of tests has become alerts. As I said in the tweets, if we broke something in the past and didn't notice, I'll add an alert for it to make sure that if we do it again, we'll find out right away this time around. Just as with the tests that programmers add, I don't expect these alerts to ever fire, and certainly not very often; if they do fire frequently, then either they're bad (just as tests can be bad) or we have a process problem, where we need to change how we operate so we stop making this particular mistake so often.
This is somewhat of a divergence from the usual modern theory of alerts, which is that you should have only a few alerts and they should mostly be about things that cause people pain. However, I think it's in the broad scope of that philosophy, because as I understand it the purpose of the philosophy is to avoid alerts that aren't meaningful and useful and will just annoy people. If we broke something, telling us about it definitely isn't just annoying it; it's something we need to fix.
(In an environment with sophisticated alert handling, you might want to not route these sort of alerts to people's phones and the like. We just send everything to email, and generally if we're reading email it's during working hours.)
A file permissions and general deployment annoyance with Certbot
The more we use Certbot, the more I become convinced that it isn't written by people who actually operate it in anything like the kind of environment that we do (and perhaps not at all, although I hope that the EFF uses it for their own web serving). I say this because while Certbot works, there are all sorts of little awkward bits around the edges in practical operation (eg). Today's particular issue is a two part issue concerning file permissions on TLS certificates and keys (and this can turn into a general deployment issue).
Certbot stores all of your TLS certificate information under
/etc/letsencrypt/live, which is normally owned by root and is
root-only (Unix mode 0700). Well, actually, that's false, because
normally the contents of that directory hierarchy are only symlinks
/etc/letsencrypt/archive, which is also owned by root and
root-only. This works fine for daemons that read TLS certificate
material as root, but not all daemons do; in particular, Exim reads
them as the Exim user and group.
The first issue is that Certbot adds an extra level of permissions
to TLS private keys. As covered by Certbot's documentation, from
Certbot version 0.29.0, private keys for certificates are specifically
root-only. This means that you can't give Exim access to the TLS
keys it needs just by chgrp'ing
/etc/letsencrypt/archive to the Exim group and then making them
mode 0750; you must also specifically chgrp and chmod the private
key files. This can be automated with a deploy hook script, which
will be run when certificates are renewed.
(Documentation for deploy hooks is hidden away in the discussion of renewing certificates.)
The second issue is that deploy hooks do exactly and only what they're documented to do, which means that deploy hooks do not run the first time you get a certificate. After all, the first time is not a renewal, and Certbot said specifically that deploy hooks run on renewal, not 'any time a certificate is issued'. This means that all of your deployment automation, including changing TLS private key permissions so that your daemons can access the keys, won't happen when you get your initial certificate. You get to do it all by hand.
(You can't easily do it by running your deployment script by hand, because your deployment script is probably counting on various environment variables that Certbot sets.)
We currently get out of this by doing the chgrp and chmod by hand when we get our initial TLS certificates; this adds an extra manual step to initial host setup and conversions to Certbot, which is annoying. If we had more intricate deployment, I think we would have to force an immediate renewal after the TLS certificate had been issued, and to avoid potentially running into rate limits we might want to make our first TLS certificate be a test certificate. Conveniently, there are already other reasons to do this.
Finding metrics that are missing labels in Prometheus (for alert metrics)
One of the things you can abuse metrics for in Prometheus is to
configure different alert levels, alert destinations, and so on for
different labels within the same metric, as I wrote about back in
my entry on using group_* vector matching for database lookups. The example in that entry used two metrics
the former showing the current available space and the latter
describing the alert levels and so on we want. Once we're using
metrics this way, one of the interesting questions we could ask is
what filesystems don't have a space alert set. As it turns out, we
can answer this relatively easily.
The first step is to be precise about what we want. Here, we want
to know what '
fs' labels are missing from
fs label is missing if it's not present in
but is present in
our_zfs_avail_gb. Since we're talking about
sets of labels, answering this requires some sort of set operation.
our_zfs_minfree_gb only has unique values for the
(ie, we only ever set one alert per filesystem), then this is
our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb
our_zfs_avail_gb metric generates our initial set of known
fs labels. Then we use UNLESS to subtract the set of all
labels that are present in
our_zfs_minfree_gb. We have to use
ON(fs)' because the only label we want to match on between the
two metrics is the
fs label itself.
However, this only works if
our_zfs_minfree_gb has no duplicate
fs labels. If it does (eg if different people can set their own
alerts for the same filesystem), we'd get a 'duplicate series' error
from this expression. The usual fix is to use a one to many match,
but those can't be combined with set operators
unless'. Instead we must get creative. Since all we care
about is the labels and not the values, we can use an aggregation
to give us a single series for each label on the right side of the
our_zfs_avail_gb UNLESS ON(fs) count(our_zfs_minfree_gb) by (fs)
As a side effect of what they do, all aggregation operators condense
multiple instances of a label value this way. It's very convenient
if you just want one instance of it; if you care about the resulting
value being one that exists in your underlying metrics you can use
You can obviously invert this operation to determine 'phantom' alerts,
alerts that have
fs labels that don't exist in your underlying metric.
That expression is:
count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs) our_zfs_avail_gb
(Here I'm assuimg
our_zfs_minfree_gb has duplicate
if it doesn't, you get a simpler expression.)
Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.
This general approach can be applied to any two metrics where some
label ought to be paired up across both. For instance, you could
cross-check that every
node_info_uname metric is matched by one
or more custom per-host informational metrics that your own software
is supposed to generate and expose through the node exporter's
(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)
Bidirectional NAT and split horizon DNS in our networking setup
Like many other places, we have far too many machines to give them all public IPs (or at least public IPv4 IPs), especially since they're spread across multiple groups and each group should get its own isolated subnet. Our solution is the traditional one; we use RFC 1918 IPv4 address space behind firewalls, give groups subnets within it (these days generally /16s), and put each group in what we call a sandbox. Outgoing traffic from each sandbox subnet is NAT'd so that it comes out from a gateway IP for that sandbox, or sometimes a small range of them.
However, sometimes people quite reasonably want to have some of their sandbox machines reachable from the outside world for various reasons, and also sometimes they need their machines to have unique and stable public IPs for outgoing traffic. To handle both of these cases, we use OpenBSD's support for bidirectional NAT. We have a 'BINAT subnet' in our public IP address space and each BINAT'd machine gets assigned an IP on it; as external traffic goes through our perimeter firewall, it does the necessary translation between internal addresses and external ones. Although all public BINAT IPs are on a single subnet, the internal IPs are scattered all over all of our sandbox subnets. All of this is pretty standard.
(The public BINAT subnet is mostly virtual, although not entirely so; for various peculiar reasons there are a few real machines on it.)
However, this leaves us with a DNS problem for internal machines (machines behind our perimeter firewall) and internal traffic to these BINAT'd machines. People and machines on our networks want to be able to talk to these machines using their public DNS names, but the way our networks are set up, they must use the internal IP addresses to do so; the public BINAT IP addresses don't work. Fortunately we already have a split-horizon DNS setup, because we long ago made the decision to have a private top level domain for all of our sandbox networks, so we use our existing DNS infrastructure to give BINAT'd machines different IP addresses in the internal and external views. The external view gives you the public IP, which works (only) if you come in through our perimeter firewall; the internal view gives you the internal RFC 1918 IP address, which works only inside our networks.
(In a world where new gTLDs are created like popcorn, having our own top level domain isn't necessarily a great idea, but we set this up many years before the profusion of gTLDs started. And I can hope that it will stop before someone decides to grab the one we use. Even if they do grab it, the available evidence suggests that we may not care if we can't resolve public names in it.)
Using split-horizon DNS this way does leave people (including us) with some additional problems. The first one is cached DNS answers, or in general not talking to the right DNS servers. If your machine moves between internal and external networks, it needs to somehow flush and re-resolve these names. Also, if you're on one of our internal networks and you do DNS queries to someone else's DNS server, you'll wind up with the public IPs and things won't work. This is a periodic source of problems for users, especially since one of the ways to move on or off our internal networks is to connect to our VPN or disconnect from it.
The other problem is that we need to have internal DNS for any public name that your BINAT'd machine has. This is no problem if you give your BINAT machine a name inside our subdomain, since we already run DNS for that, but if you go off to register your own domain for it (for instance, for a web site), things can get sticky, especially if you want your public DNS to be handled by someone else. We don't have any particularly great solutions for this, although there are decent ones that work in some situations.
(Also, you have to tell us what names your BINAT'd machine has. People don't always do this, probably partly because the need for it isn't necessarily obvious to them. We understand the implications of our BINAT system, but we can't expect that our users do.)
(There's both an obvious reason and a subtle reason why we can't apply BINAT translation to all internal traffic, but that's for another entry because the subtle reason is somewhat complicated.)