2017-07-12
Understanding the .io TLD's DNS configuration vulnerability
First there was Matthew Bryant's The .io Error - Taking Control of All .io Domains With a Targeted Registration, about a configuration error that allegedly allowed you to take over control of some .io nameservers, and then there was a response to it, Matt Pounsett's The .io Error: A Problem With Bad Optics, But Little Substance, which argued that this was much ado about nothing much. While I agree that the consequences are less severe than Bryant thought, I think that Pounsett's article understates the risks itself (and I believe doesn't correctly explain what's going on in the DNS here). In any case, the whole thing confused me and other people, so I'm going to write my understanding of things up here.
Let's start with the basics of compromising a domain through dangling
nameserver delegation. Suppose you find a domain barney.io
that
lists ns1.fred.ly
as one of its two nameservers, and fred.ly
is not registered (worse nameserver mistakes happen).
To attack barney.io
you register fred.ly
and create a ns1.fred.ly
A
record that points to a nameserver that you're running. Some
portion of the people looking up information in barney.io
will
wind up querying your nameserver, and at that point you can give
them whatever answers you want. If they're asking their original
question, you can directly lie to
them (telling people that all MX
entries in barney.io
point to
harvestmail.fred.ly
, for example). If they're making NS
queries
to check for zone delegation, you can just give them NS
records
that point to you and start lying some more when they follow those
NS
records.
(You can then increase how many people will talk to ns1.fred.ly
by DOSing the other barney.io
DNS server
off the Internet.)
This is more or less what the setup was for .io
. Among .io
's
nameservers were ns-a1.io
through ns-a4.io
, and all of those
names could be registered as domains in .io
and then given A
records in your DNS data for your new domain(s) (and Matthew Bryant
did just this with ns-a1.io
). However, there was an important
difference that made this less severe than my example, and that's
that .io
had active glue records in
the root zone for those names that pointed people to the IP addresses
of the real nameservers. With these glue records present, a client
didn't talk to Matthew Bryant's DNS server just because it decided
to use ns-a1.io
as part of resolving a .io
name; if it believed
and used the glue records, it would wind up talking to the real
nameserver. You only had your query diverted to Bryant's DNS server
if you decided to send a query to ns-a1.io
but not use the IP
from the glue record and instead look it up directly.
Using data from glue records instead of looking things up yourself
is common but not mandatory, and there are various reasons why a
resolver would not do so. Some recursive DNS servers will deliberately
try to check glue record information as a security measure; for
example, Unbound has the
harden-referral-path
option (via Tony Finch). Since the
original article
reported seeing real .io
DNS queries being directed to Bryant's
DNS server, we know that a decent number of clients were not using
the root zone glue records. Probably a lot more clients were still
using the glue records, through.
(There are a bunch of uncertainties about just what DNS data was
being returned by who during the incident. The original article
shows a reply from a root server and that probably didn't change,
but we don't know what the official .io
servers themselves started
returning as glue records for .io
during the time that ns-a1.io
was active as a domain registration. I will decline to speculate on
what was the likely result here.)
Given my history with glue record hell,
it amuses me that this is a case where dangling glue records helped
instead of hurt, making a problem less severe than it would otherwise
have been. Had there been no glue records or incomplete glue records
for the .io
zone, there would have been more danger (or at least
the danger would have been more clearer).
(In this case the presence of the glue records was mandatory, since
these were NS
names inside the zone itself. Without glue records
in the root zone, you would have a chicken and egg problem in getting
the IP address of, say, a0.nic.io
.)
PS: As far as I can see from Bryant's article, he didn't realize
that the root zone glue records would cause many clients to not
query his DNS servers, significantly reducing the severity of someone
having control over the names of four of the seven .io
DNS servers.
As far as Pounsett's article goes, he appears to more or less spot
the issue with root glue but doesn't explain it and appears to
expect all clients to use the glue all of the time (which is
demonstrably not the case). I think he may also be confusing the
data in the .io
zone with the root zone glue for .io
. Note that
it's not necessary to get your IP address for ns-a1.io
included
in the .io
zone; to make some clients start talking to you, it's
sufficient for NS
records for ns-a1.io
to show up and ideally
to occlude the A
and AAAA
records.
(We know that Bryant's NS
records showed up in the .io
zone.
We don't know if they occluded the A
record for ns-a1.io
that
was there, but it seems likely that they did.)
Sidebar: What I suspect went wrong in .io
's procedures
It seems quite likely that ns-a1.io
through ns-a4.io
were
intended to be purely host names of DNS servers, not domain names,
much like my example of ns1.fred.ly
. However, they were placed
directly in the apex of a zone (.io
) that allows people to register
domains, and I suspect that the people running the IO zone forgot
to tell the people running the IO registry that these names existed
in the zone as host names and should be locked out from domain
registration. That's been fixed now, obviously, and WHOIS tells
me they're 'Reserved by Registry'.
(This is thus a different failure mode than having NS
records for
your domain or TLD that point to hosts in entirely unregistered
domains. That's a pure failure, since the names don't exist at
all except perhaps through lingering glue records.
Here the names existed entirely properly, it's just that the IO
registry was allowed to override them with new data.)
The problem doesn't come up for the other .io
nameservers, which
are all under nic.io
, since nic.io
is already a registered
domain in .io
.
Recursive DNS servers send the whole original query to authoritative servers
As a long term sysadmin, I usually feel that I have a solid technical grasp of DNS (apart from DNSSEC, which I ignore on principle). Then every so often, I get to find out that I'm wrong. Today is one of those days.
Before today, if you had asked me how a recursive DNS server did a lookup from authoritative servers, I would have told you what is basically the standard story. If you're looking up the A record for fred.blogs.example.com, your local recursive server first asks a random root server for the NS records for .com, then asks one of the .com DNS servers for the NS records for example.com, then asks one of those theoretically authoritative DNS servers, and so on. Although this describes the chain of NS delegations that your recursive DNS server typically gets back, it turns out that this doesn't accurately describe what your server usually sends as its queries. The normal practice today is that your recursive DNS server sends the full original query to each authoritative server. It doesn't ask the root servers 'what is the NS for .com'; instead it asks the root servers 'what is the A record for fred.blogs.example.com', and they send back an answer that is basically 'I have no idea, ask one of these .com nameservers'.
Once I thought about it, this behavior made a lot of sense because
DNS clients don't know in advance where zone delegation boundaries
are. It's common for there to be zone boundaries between each
.
, but it's not always the case; you can certainly have zones
where the zone boundaries are further apart. You can even have zones
where it varies by the name. Consider a hypothetical .ca
zone
operator who allows registration of both <domain>.<province>.ca (eg
fred.on.ca
) and <domain>.ca (eg bob.ca
), and does not have
separate <province>.ca zones; they just carry all of the data in one
zone without internal NS records. Here bob.ca
has a NS but on.ca
doesn't, and your client certainly can't know which is which in advance.
When the client has no idea where the zone boundaries are, the simple
thing to do is to send the whole original query off to each step of the
delegation chain and see what they say. This way you don't have to try
any sort of backtracking when you ask for a NS for .on.ca
and get a
no data answer back.
Now, you might ask if sending the full query to all DNS servers in
the chain like this has privacy implications. Why yes, it does, and there are
proposed ways to work around it, such as RFC 7816 query minimization. Some DNS servers are already
taking steps here; for example, current versions of Unbound have the qname-minimisation
option in
unbound.conf.
(My discovery is due to reading this article. I believe that the article overstates things a bit itself, but that's another entry (or see this Twitter thread).)