Some general things and views on DNS over HTTPS
Over on Twitter I said something and then Brad Beyenhof asked me a sensible question related to DNS over HTTPS. Before I elaborate my Twitter answers to that specific question, I want to do an overview of some general views on DNS over HTTPS on the whole.
DNS over HTTPS (hereafter DoH) is what it sounds like; it's a protocol for making DNS queries over a HTTPS connection. There is also the older DNS over TLS, but my impression is that DoH has become more popular, perhaps partly because it's more likely to make it through middleware firewalls and so on. DoH and DoT are increasingly popular ideas for the straightforward reason that on the modern Internet, ISPs are one of your threats. A significant number of ISPs snoop on your DNS traffic to harvest privacy-invasive things about your Internet activities, and some of them actively tamper with DNS results. DoH (and DoT) mostly puts a stop to both DNS snooping and DNS interference.
(It's likely that DoH doesn't completely stop these because of the possibility of traffic analysis of the encrypted traffic, which has already been shown to reveal information about HTTPS browsing, but we probably won't know how serious an issue this is until DoH is enough used in the field to attract research attention.)
As far as I can tell, DoH (and DoT) are currently used and intended to be used between clients and resolving DNS servers. If there is work to protect queries between a resolving DNS server and authoritative servers, I think it's using a different protocol (Wikipedia mentions DNSCurve, but it's not widely adopted). This doesn't matter for typical users, who use someone else's DNS server, but it matters for people who run their own local resolving server that wants to query directly to authoritative servers instead of delegating to another resolving server.
DoH protects you from your ISP monitoring or tampering with your DNS queries, but it doesn't protect you from your DoH server of choice doing either. To protect against tampering, you'd need some form of signed DNS (but then see Against DNSSEC). It also doesn't currently protect you against your ISP knowing what HTTPS websites you go on to visit, because of SNI; until we move to ESNI, we're always sending the website name (but not the URL) in the clear as part of the early TLS negotiation, so your ISP can just capture it there. In fact, if your ISP wants to know what HTTPS sites you visit, harvesting the information from SNI is easier than getting your DNS queries and then correlating things.
(There is very little protection possible against your DoH server monitoring your activity; you either need to have a trustworthy DNS server provider or run your server yourself in a situation where its queries to authoritative servers will probably not be snooped on.)
More or less hard-coded use of DNS over HTTPS servers in client programs like browsers poses the same problems for sysadmins as basically any hard-coded use of an upstream resolver does, which is that it will bypass any attempts to use split-horizon DNS or other ways of providing internal DNS names and name bindings. Since these are obvious issues, I can at least optimistically hope that organizations like Mozilla are thinking about how to make this work without too much pain (see the current Mozilla wiki entry for some discussion of this; it appears that Mozilla's current setup will work for unresolvable names but not names that get a different result between internal and external DNS).
PS: There are a number of recursive DNS servers that can be configured to use DoH to an upstream recursive server; see eg this brief guide for Unbound. Unbound can also be configured to be a DoH server itself for clients, but this doesn't really help the split horizon case for reasons beyond the scope of this entry.
A Linux machine with a strict overcommit limit can still trigger the OOM killer
We've been running our general use compute servers with strict overcommit handling for total virtual memory for years, because on compute servers we feel we have to assume that if you ask for a lot of memory, you're going to use it for your compute job. As we discovered last fall, hitting the strict overcommit limit doesn't trigger the OOM killer, which can be inconvenient since instead all sorts of random processes start failing since they can't get any more memory. However, we've recently also discovered that our machines with strict overcommit turned on can still sometimes trigger the OOM killer.
At first this made no sense to me and I thought that something was
wrong, but then I realized what is probably going on. You see,
strict overcommit really has two parts, although we don't often
think about the second one; there's the setting itself, ie having
vm.overcommit_memory be 2, and then how much your commit limit
is, set by
vm.overcommit_ratio as your swap space plus some
percentage of RAM. Because we couldn't find an overcommit percentage
that worked for us across our disparate fleet of compute servers
with very varying amounts of RAM, we set this to '100' some years
ago, theoretically allowing our machines with strict overcommit to
use all of RAM plus swap space. Of course, this is not actually
possible in practice, because the kernel needs some amount of memory
to operate itself; how much memory is unpredictable and possibly
This gap between what we set and what's actually possible creates three states the system can wind up in. If you ask for as much memory as you can allocate (or in general enough memory), you run the system into the strict overcommit limit; either your request fails immediately or other processes start failing later when their memory allocation requests fail. If you don't ask for too much memory, everything is happy; what you asked for plus what the kernel needs fits into RAM and swap space. But if you ask for just the right large amount of memory, you push the system into a narrow middle ground; you're under the strict overcommit limit so your allocations succeed, but over what the kernel can actually provide, so when processes start trying to use enough memory, the kernel will trigger the OOM killer.
There is probably no good way to avoid this for us, so I suspect we'll just live with the little surprise of the OOM killer triggering every so often and likely terminating a RAM-heavy compute process. I don't think it happens very often, and these days we have a raft of single-user compute servers that avoid the problem.
Sidebar: The problems with attempting to turn down the memory limit
First, we don't have any idea how much memory we'd need to reserve for the kernel to avoid OOM. Being cautious here means that some of the RAM will go idle unless we add a bunch of swap space (and risk death through swap trashing).
Further, not only would the
vm.overcommit_ratio setting be
machine specific and have to be derived on the fly from the amount
of memory, but it's probably too coarse-grained. 1% of RAM on a 256
GB machine is 2.5 GB, although I suppose perhaps the kernel might
need that much reserved to avoid OOM.
We could switch to using the more recent
vm.overcommit_kbytes (cf), but since
its value is how much RAM to allow instead of how much RAM to reserve
for the kernel, we would definitely have to make it machine specific
and derived from how much RAM is visible when the machine boots.
On the whole, living with the possibility of OOM is easier and less troublesome.