A hazard of using synthetic data in tests, illustrated by me

August 29, 2014

My current programming enthusiasm is a 'sinkhole' SMTP server that exists to capture incoming email for spam analysis purposes. As part of this it supports matching DNS names against hostnames or hostname patterns that you supply, so you can write rules like:

@from reject host .boring.spammer with message "We already have enough"

Well, that was the theory. The practice was that until very recently this feature didn't actually work; hostname matches always failed. The reason I spent so much time not noticing this is that the code's automated tests passed. Like a good person I had written the code to do this matching and then written tests for it, in fact tests for it even (and especially) in the context of these rules. All of these tests passed with flying colours, so everything looked great (right up until it clearly failed in practice while I was lucky enough to be watching).

One of the standard problems of testing DNS-based features (such as testing matching against the DNS names of an IP address) is that DNS is an external dependency and a troublesome one. If you make actual DNS queries to actual Internet DNS servers, you're dependent on both a working Internet connection and the specific details of the results returned by those DNS servers. As a result people often mock out DNS query results in tests, especially low level tests. I was no exception here; my test harness made up a set of DNS results for a set of IPs.

(Mocking DNS query results is especially useful if you want to test broken things, such as IP addresses with predictably wrong reverse DNS.)

Unfortunately I got those DNS results wrong. The standard library for my language puts a . at the end of all reverse DNS queries, eg the result of looking up the name of is (currently) 'google-public-dns-a.google.com.' (note the end). Most standard libraries for most languages don't do that, and while I knew that Go's was an exception I had plain overlooked this while writing the synthetic DNS results in my tests. So my code was being tested against 'DNS names' without the trailing dot and matched them just fine, but it could never match actual DNS results in live usage because of the surprise final '.'.

This shows one hazard of using synthetic data in your tests: if you use synthetic data, you need to carefully check that it's accurate. I skipped doing that and I paid the price for it here.

(The gold standard for synthetic data is to make it real data that you capture once and then use forever after. This is relatively easy in algnauges with a REPL but is kind of a pain in a compiled language where you're going to have to write and debug some one-use scratch code.)

Sidebar: how the Go library tests deal with this

I got curious and looked at the tests for Go's standard library. It appears that they deal with this by making DNS and other tests that require external resources be optional (and by hardcoding some names and eg Google's public DNS servers). I think that this is a reasonably good solution to the general issue, although it wouldn't have solved my testing challenges all by itself.

(Since I want to test results for bad reverse DNS lookups and so on, I'd need a DNS server that's guaranteed to return (or not return) all sorts of variously erroneous things in addition to some amount of good data. As far as I know there are no public ones set up for this purpose.)

Comments on this page:

By Ewen McNeill at 2014-08-29 03:10:15:

Two observations:

1. A common way to get hard to generate test "good" results is to put in a fake "valid" answer, run the test once, then hand check the actual result is what you thought, and then adopt (eg, cut'n'paste or put into "expected" file) the result from the "actually got" output. It at least avoids writing one-use code to get the valid test data.

2. As a thought, starting up an on-host DNS server preloaded with fake zones from the root might be a reasonable solution. Although if you can't easily force DNS queries to something other than port 53 (or override the /etc/resolv.conf contents) it may be harder to hook in. (On Linux I've sometimes used iptables OUTPUT tables NAT to solve that sort of problem, but that's not entirely portable.)


My suggestion would be to write the test so that it works against live DNS, with a short-circuit path that yields mock data from a stub when enabled. And you don’t bother to make up a mock data set, you run the test live the first time and record those results to use as mock data.

That way it’s not one-use scratch code, it’s the canonical test, and you can rerun it any time, and you can redo the mock data capture at any time, to reflect any new realities.

(This kinda ties in to your point to SaveYourTests.)

Written on 29 August 2014.
« One reason why we have to do a major storage migration
How to change your dm-cache write mode on the fly in Linux »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 29 00:32:03 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.