A hazard of using synthetic data in tests, illustrated by me

August 29, 2014

My current programming enthusiasm is a 'sinkhole' SMTP server that exists to capture incoming email for spam analysis purposes. As part of this it supports matching DNS names against hostnames or hostname patterns that you supply, so you can write rules like:

@from reject host .boring.spammer with message "We already have enough"

Well, that was the theory. The practice was that until very recently this feature didn't actually work; hostname matches always failed. The reason I spent so much time not noticing this is that the code's automated tests passed. Like a good person I had written the code to do this matching and then written tests for it, in fact tests for it even (and especially) in the context of these rules. All of these tests passed with flying colours, so everything looked great (right up until it clearly failed in practice while I was lucky enough to be watching).

One of the standard problems of testing DNS-based features (such as testing matching against the DNS names of an IP address) is that DNS is an external dependency and a troublesome one. If you make actual DNS queries to actual Internet DNS servers, you're dependent on both a working Internet connection and the specific details of the results returned by those DNS servers. As a result people often mock out DNS query results in tests, especially low level tests. I was no exception here; my test harness made up a set of DNS results for a set of IPs.

(Mocking DNS query results is especially useful if you want to test broken things, such as IP addresses with predictably wrong reverse DNS.)

Unfortunately I got those DNS results wrong. The standard library for my language puts a . at the end of all reverse DNS queries, eg the result of looking up the name of is (currently) 'google-public-dns-a.google.com.' (note the end). Most standard libraries for most languages don't do that, and while I knew that Go's was an exception I had plain overlooked this while writing the synthetic DNS results in my tests. So my code was being tested against 'DNS names' without the trailing dot and matched them just fine, but it could never match actual DNS results in live usage because of the surprise final '.'.

This shows one hazard of using synthetic data in your tests: if you use synthetic data, you need to carefully check that it's accurate. I skipped doing that and I paid the price for it here.

(The gold standard for synthetic data is to make it real data that you capture once and then use forever after. This is relatively easy in algnauges with a REPL but is kind of a pain in a compiled language where you're going to have to write and debug some one-use scratch code.)

Sidebar: how the Go library tests deal with this

I got curious and looked at the tests for Go's standard library. It appears that they deal with this by making DNS and other tests that require external resources be optional (and by hardcoding some names and eg Google's public DNS servers). I think that this is a reasonably good solution to the general issue, although it wouldn't have solved my testing challenges all by itself.

(Since I want to test results for bad reverse DNS lookups and so on, I'd need a DNS server that's guaranteed to return (or not return) all sorts of variously erroneous things in addition to some amount of good data. As far as I know there are no public ones set up for this purpose.)

Written on 29 August 2014.
« One reason why we have to do a major storage migration
How to change your dm-cache write mode on the fly in Linux »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 29 00:32:03 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.