Uncertainties over adding IP ratelimits to your local DNS resolvers
In a comment on my entry on splitting our local DNS resolvers apart to serve different audiences, David Magda asked if using per-IP ratelimiting was a potential solution. My feeling is that it would be difficult for us to do this today with any confidence, and it's not clear to me that reasonable per-IP ratelimits would stop all the problems.
We use Unbound as our DNS resolver, partly because that's what OpenBSD seems to like for this (our local DNS resolvers are OpenBSD machines). Unbound's per-IP ratelimiting is currently considered experimental, but we've had good luck with 'experimental' Unbound features before (we were using general ratelimiting when that was marked as experimental). However, this still leaves us with two or perhaps three problems.
The first is trying to determine what per-IP ratelimit we should set. You can certainly pick 'reasonable' numbers, but that's just guessing; what you really need is something like a histogram of how many IPs hit what peak QPS rates how often. That would let you pick a limit with some confidence that even unusual systems wouldn't hit it in legitimate operation. We've started to gather some information based on OpenBSD pf state counts on our firewalls, and it turns out that the numbers are a bit surprising.
(A similar issue applies for general ratelimiting. We don't actually
know general our queries per second distribution for
our current setting is a guess, and might be either too high or too
The second issue is that the problem may not be with single IPs that flood us with a high query volume (or may not just be that). In today's environment, it might be that we're seeing issues where certain sorts of devices all get into a bad state at the same time and start sending a bunch more queries than usual, but not so many that they would be unreasonable for any single IP by itself. This kind of bad behavior might be hard to trigger and hard to see (if, for example, it only happens when there's the right sort of network glitch). There's a lot of software monoculture these days and that provides plenty of opportunities for problems to be amplified.
(Getting insight into collective behavior needs fairly detailed statistics or monitoring, which is not feasible for us for our DNS.)
The third potential issue is that currently Unbound's IP ratelimiting is a global setting. There's no support for giving some IPs one ratelimit and different IPs another ratelimit (or no ratelimit). With no ability to set different ratelimits for different IPs, we'd have to set very conservative ratelimits to insure that our critical machines would never be locked out from doing DNS queries even under some unpredictable situation of high (DNS) load.
(Unbound may change this in the future.)
My overall feeling is that per-IP ratelimiting for local DNS clients is currently quite hard to get right if you aren't willing to either do a lot of complex monitoring and crunching of statistics in advance, or set somewhat arbitrary limits and cut clients off if they hit those limits. The latter is certainly an option in some environments, but is not ideal in an setting where you're trying to be friendly and helpful (as we strive to be).
(One thing that could help this is a dry-run mode for ratelimits, where you could set your DNS resolver to simply log if a client would have hit rate limits but not actually limit them. Then you could experiment to see how often a particular ratelimit would act if it was real, and on how many clients.)