Wandering Thoughts archives

2014-08-29

A hazard of using synthetic data in tests, illustrated by me

My current programming enthusiasm is a 'sinkhole' SMTP server that exists to capture incoming email for spam analysis purposes. As part of this it supports matching DNS names against hostnames or hostname patterns that you supply, so you can write rules like:

@from reject host .boring.spammer with message "We already have enough"

Well, that was the theory. The practice was that until very recently this feature didn't actually work; hostname matches always failed. The reason I spent so much time not noticing this is that the code's automated tests passed. Like a good person I had written the code to do this matching and then written tests for it, in fact tests for it even (and especially) in the context of these rules. All of these tests passed with flying colours, so everything looked great (right up until it clearly failed in practice while I was lucky enough to be watching).

One of the standard problems of testing DNS-based features (such as testing matching against the DNS names of an IP address) is that DNS is an external dependency and a troublesome one. If you make actual DNS queries to actual Internet DNS servers, you're dependent on both a working Internet connection and the specific details of the results returned by those DNS servers. As a result people often mock out DNS query results in tests, especially low level tests. I was no exception here; my test harness made up a set of DNS results for a set of IPs.

(Mocking DNS query results is especially useful if you want to test broken things, such as IP addresses with predictably wrong reverse DNS.)

Unfortunately I got those DNS results wrong. The standard library for my language puts a . at the end of all reverse DNS queries, eg the result of looking up the name of 8.8.8.8 is (currently) 'google-public-dns-a.google.com.' (note the end). Most standard libraries for most languages don't do that, and while I knew that Go's was an exception I had plain overlooked this while writing the synthetic DNS results in my tests. So my code was being tested against 'DNS names' without the trailing dot and matched them just fine, but it could never match actual DNS results in live usage because of the surprise final '.'.

This shows one hazard of using synthetic data in your tests: if you use synthetic data, you need to carefully check that it's accurate. I skipped doing that and I paid the price for it here.

(The gold standard for synthetic data is to make it real data that you capture once and then use forever after. This is relatively easy in algnauges with a REPL but is kind of a pain in a compiled language where you're going to have to write and debug some one-use scratch code.)

Sidebar: how the Go library tests deal with this

I got curious and looked at the tests for Go's standard library. It appears that they deal with this by making DNS and other tests that require external resources be optional (and by hardcoding some names and eg Google's public DNS servers). I think that this is a reasonably good solution to the general issue, although it wouldn't have solved my testing challenges all by itself.

(Since I want to test results for bad reverse DNS lookups and so on, I'd need a DNS server that's guaranteed to return (or not return) all sorts of variously erroneous things in addition to some amount of good data. As far as I know there are no public ones set up for this purpose.)

SyntheticTestDataHazard written at 00:32:03; Add Comment

2014-08-20

Explicit error checking and the broad exception catching problem

As I was writing yesterday's entry on a subtle over-broad try in Python, it occurred to me that one advantage of a language with explicit error checking, such as Go, is that a broad exception catching problem mostly can't happen, especially accidentally. Because you check errors explicitly after every operation, it's very hard to aggregate error checks together in the way that a Python try block can fall into.

As an example, here's more or less idiomatic Go code for the same basic operation:

for _, u := range userlist {
   fi, err := os.Stat(u.hdir)
   if err != nil || !(fi.IsDir() && fi.Mode().Perm() == 0) {
      fmt.Println(u.name)
   }
}

(Note that I haven't actually tried to run this code so it may have a Go error. It does compile, which in a statically typed language is at least a decent sign.)

This does the stat() of the home directory and then prints the user name if either there was an error or the homedir is not a mode 000 directory, corresponding to what happened in the two branches of the Python try block. When we check for an error, we're explicitly checking the result of the os.Stat() call and it alone.

Wait, I just pulled a fast one. Unlike the Python version, this code's printing of the username is not checking for errors. Sure, the fmt.Println() is not accidentally being caught up in the error check intended for the os.Stat(), but we've exchanged this for not checking the error at all, anywhere.

(And this is sufficiently idiomatic Go that the usual tools like go vet and golint won't complain about it at all. People ignore the possibility of errors from fmt.Print* functions all the time; presumably complaining about them would create too much noise for a useful checker.)

This silent ignoring of errors is not intrinsic to explicit error checking in general. What enables it here is that Go, like C, allows you to quietly ignore all return values from a function if you want instead of forcing you to explicitly assign them to dummy variables. The real return values of fmt.Println() are:

n, err := fmt.Println(u.name)

But in my original Go code there is nothing poking us in the nose about the existence of the err return value. Unless we think about it and remember that fmt.Println() can fail, it's easy to overlook that we're completely ignoring an error here.

(We can't do the same with os.Stat() because the purpose of calling it is one of the return values, which means that we have to at least explicitly ignore the err return instead of just not remembering that it's there.)

(This is related to how exceptions force you to deal with errors, of course.)

PS: I think that Go made the right pragmatic call when it allowed totally ignoring return values here. It's not completely perfect but it's better than the real alternatives, especially since there are plenty of situations where there's nothing you can do about an error anyways.

Sidebar: how you can aggregate errors in an explicit check language

Languages with explicit error checks still allow you to aggregate errors together if you want to, but now you have to do it explicitly. The most common pattern is to have a function that returns an error indicator and performs multiple different operations, each of which can fail. Eg:

func oneuser(u user) error {
   var err error
   fi, err := os.Stat(u.hdir)
   if err != nil {
      return err
   }
   if !(fi.IsDir() && fi.Mode().Perm() == 0) {
      _, err = fmt.Println(u.name)
   }
   return err
}

If we then write code that assumes that a non-nil result from oneuser() means that the os.Stat() has failed, we've done exactly the same error aggregation that we did in Python (and with more or less the same potential consequences).

ExplicitErrorsAndBroadCatches written at 01:57:21; Add Comment

2014-08-18

The potential issue with Go's strings

As I mentioned back in Things I like about Go, one of the Go things that I really like is its strings (and slices in general). From the perspective of a Python programmer, what makes them great is that creating strings is cheap because they often don't require a copy. In Python, any time you touch a string you're copying some or all of it and this can easily have a real performance impact. Writing performant Python code requires considering this carefully. In Go, pretty much any string operation that just takes a subset of the string (eg trimming whitespace from the front and the end) is copy-free, so you can throw around string operations much more freely. This can make a straightforward algorithm both the right solution to your problem and pretty efficient.

(Not all interesting string operations are copy-free, of course. For example, converting a string to all upper case requires a copy, although Go's implementation is clever enough to avoid this if the string doesn't change, eg because it's already all in upper case.)

But this goodness necessarily comes with a potential badness, which is that those free substrings keep the entire original string alive in memory. What makes Go strings (and slices) so cheap is that they are just references to some chunk of underlying storage (the real data for the string or the underlying array for a slice); making a new string is just creating a new reference. But Go doesn't (currently) do partial garbage collection of string data or arrays, so if even one tiny bit of it is referred to somewhere the entire object must be retained. In other words, a string that's a single character is (currently) enough to keep a big string from being garbage collected.

This is not an issue that many people will run into, of course. To hit it you need to either be dealing with very big original strings or care a lot about memory usage (or both) and on top of that you have to create persistent small substrings of the non-persistent original strings (well, what you want to be non-persistent). Many usage patterns won't hit this; your original strings are not large, your subsets cover most of the original string anyways (for example if you break it up into words), or even the substrings don't live very long. In short, if you're an ordinary Go programmer you can ignore this. The people who care are handling big strings and keeping small chunks of them for a long time.

(This is the kind of thing that I notice because I once spent a lot of effort to make a Python program use as little memory as possible even though it was parsing and storing chunks out of a big configuration file. This made me extra-conscious about things like string lifetimes, single-copy interned strings, and so on. Then I wrote a parser in Go, which made me consider all of these issues all over again and caused me to realize that the big string representing my entire input file was going to be kept in memory due to the bits of it that my parser was clipping out and keeping.)

By the way, I think that this is the right tradeoff for Go to make. Most people using strings will never run into this, while it's very useful that substrings are cheap. And this sort of cheap substrings also makes less work for the garbage collector; instead of a churn of variable length strings when code is using a lot of substrings (as happens in Python), you just have a churn of fixed-size string references.

Of course there's the obvious fix if your code starts running into this: create a function that 'minimizes' a string by turning it into a []byte and then back. This creates a minimized string at the cost of an extra copy over the theoretical ideal implementation and can be trivially done in Go today.

Sidebar: How strings.ToUpper() et al avoid unnecessary copies

All of the active transformation functions like ToUpper() and ToTitle() are implemented using strings.Map() and functions from the unicode package. Map() is smart enough to not start making a new string until the mapping function returns a different rune than the existing one. As a result, any similar direct use of Map() that your code has will get this behavior for free.

GoStringsMemoryHolding written at 00:45:35; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.