The DNS TTL problem

March 26, 2014

It all started with a tweet by @Twirrim:

DNS TTL records exist for a reason. For the love of all that is holy, honour them. Don't presume to think you know better.

On the one hand, as a sysadmin I'm in full agreement with this view. I certainly want all of the DNS caches and recursive DNS servers out there to respect the TTLs we set on our DNS entries and it makes me irritated when people don't. On the other hand I also have to sympathize with the operators of DNS caches out there, because I rather suspect that there are a huge number of mis-set TTLs in practice.

The problem with DNS TTLs is that they are almost always an example of information that doesn't have to be correct, and we all know what eventually happens to such information. Most people's DNS entries change very rarely and are not looked up in any huge volume, so it doesn't really matter what TTLs they have. If they have the minimum TTL you won't notice the extra lookup volume and if they have an absurdly long TTL you won't notice the lingering old entries because you aren't changing your DNS entries anyways.

(And I'm not throwing stones here. We have a number of DNS entries with short TTLs that haven't changed for years in our zones, more or less just because. It would take work to go back through our zones, find them all, verify that we really don't need short TTLs any more, and take them out. It's simpler to let them sit there and it doesn't do us any harm.)

But I bet that operators of large scale DNS caches notice those things. I rather suspect that they get customer complaints when someone updates their DNS except that they had really long TTLs on the old entries and now the customers can't get to the new servers because the old entries are stilled cached. And I suspect that they notice the extra load from short TTLs forcing useful DNS entries to be discarded even when said DNS entries haven't actually been changed in the past year. I also suspect that there are more people doing DNS TTLs somewhat wrong than there are people doing them completely right. So I can see the engineering logic in overriding DNS TTLs in your large scale cache, however inconvenient it is for me as a sysadmin.

I don't have any answers to this and in a sense there are no answers. By that I mean that the large scale DNS caches that are currently monkeying around with people's DNS TTLs are not going to change their behavior any time soon, so the most I can do is live with it.

(Then there is the thornier issue of DNS lookups being remembered by long running programs that may have no idea of TTLs at all; instead they did a getaddrinfo() once and have held on to the result ever since. I suspect that web browsers no longer fall into this category, although they once did.)


Comments on this page:

By Ewen McNeill at 2014-03-26 05:26:36:

A truce (between those creating the records, and those caching the records) would presumably be a best practice document for minimum and maximum TTLs that are "considered sensible". For instance TTLs that are, eg, shorter than 10 seconds seem bordering on unreasonably short to me (and possibly the minimum should be even longer than that). TTLs that are longer than, eg, one week seem unwisely long (even for something that will "never change").

Such a best practice document could perhaps indicate it is reasonable to clamp the TTL to values in those ranges on the caching side -- and equally that it is unreasonable to clamp the TTL to a narrower range of values on the caching side. At least one would have a better chance of predicting what might happen in any given situation.

As you say getaddrinfo(), etc, is a harder problem, because they treat all "directory services" as identical, many of which don't have TTLs at all (eg, /etc/hosts), and silently assume the data "never changes" (and thus don't expose any TTL information). Perhaps the best one can do is recommend that long running processes periodically re-lookup their information (with some indication of "periodically" -- every N seconds/every N connections/whatever).

If there is such a best practice document I'm not currently aware of it. But I am aware that there is, eg, a move within some Internet organisations to try to get more Best Practice documentation written.

Ewen

By Twirrim at 2014-03-26 17:44:01:

Original tweeter chiming in..

Part of the problem with caching DNS beyond the record's TTL is it can make it very difficult for companies to reliably provide you a service.

As a consequence of some work we've been doing recently, we discovered that even with a TTL of 60 seconds, people were still hitting a removed entry until up to some 12 hours after the change. These weren't a case of existing long running connections, or similar, but new connections being created by end users.

As well as allowing service providers to react to various circumstances to reduce impact on customers automatically, they also allow work to be carried out in a non-disruptive fashion. The net result of ignoring TTLs to such an extent is that we can only say "well we took all reasonable steps" (e.g. when it's prep for work being done, wait an hour after DNS change before executing the task). It's far from a satisfactory position, but there isn't much more that can be sanely done.

Written on 26 March 2014.
« The importance of having full remote consoles on crucial servers
Why people keep creating new package managers »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 26 00:54:49 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.