I don't understand how to test complex data structures

December 20, 2010

One of my weaknesses as a modern programmer is that I don't really understand how to do test driven development. I understand the basic ideas of automated testing and unit testing and so on, and have used them to reasonable effect every so often; where I fall down is understanding how to test things in more advanced and challenging circumstances.

My current example of this is our ZFS spares handling system. Simplified slightly, the core program works by reading state information on all of the ZFS pools and their components (some of which may be incomplete), making multiple passes over this state information to refine it and generate a collection of higher level information, and then using all of this to detect what problems exist and decide what should be done about them. Because of how ZFS organizes its configuration information, the ZFS pool data structures wind up being multi-level and relatively large and complex (ZFS is in love with things nested inside things nested inside things). Because spares replacement is a global thing, the decisions the spares system makes are based on the entire system state, not on small local attributes of one particular bit of these data structures.

Doing proper test based development of this code certainly seems to require somehow manufacturing an entire artificially damaged set of pool configurations, ideally ones that accurately reflect our production fileservers. I don't know how I am supposed to do this in a TDD world, and I don't see any particularly good way to do it.

There are two vaguely plausible approaches. First, I could try to write the base state information from scratch. The problem is that state information is very large; even for a relatively small production fileserver it's over 500 attributes (some nested), and a full scale production fileserver that's experiencing problems will have well over a thousand attributes. Hand writing configurations of this size is sufficiently time consuming and tedious (and likely error-prone) that I am simply not going to get good situation coverage.

Alternately I could start with real state information for a working system and then selectively and automatically break it in various ways, so that it looked like disks had failed, other disks were in the process of being replaced with spares, and so on. The problem is that such modifications to the state information are themselves relatively complex once you get beyond simple situations. I would have to write an entire chunk of code to carefully mutate these data structures, including adding entirely new synthetic nesting elements that were created from scratch. This has much the same problems as complex mock objects; how do I know that my mutation code is correct?

(One plausible answer from testing people is that I should not have passive attribute-based data structures but instead hierarchies of objects with complex behaviors, and then I should substitute mock objects to represent broken objects. One of the many issues with this is that it proceeds straight to the complex mock objects issue.)

Presumably test driven development has an answer to this problem. I just don't know enough about how to do this to know what it is.

Sidebar: what I do right now

Right now, I test by hand by resorting to ad hoc manual techniques such as temporarily adding code to deliberately make specific bits go wrong. This has the advantage that I can make bits of the program lie to itself, but also all sorts of disadvantages and limitations; I can't automate it, I can't test truly complex things, my testing is necessarily somewhat indirect and artificial, and so on.

Oh, and I have to take the test code out before pushing updates to production (then put it back in the next time I have something to add or debug).

Comments on this page:

From at 2010-12-20 05:26:51:

It sounds like the ad-hoc test code that you now insert and remove manually could serve as mock components in an automated testing system.

From at 2010-12-20 09:35:26:

When I read the title, I was originally a little taken aback. After reading the article, I realized why: I saw "I don't understand how" and immediately assumed you meant "I don't understand why". Aside from StackOverflow, I don't recall many incidents of programmers admitting ignorance. ;)

For what it's worth, I find your process-of-discovery sequencing of posts interesting and your candor refreshing. Thanks for writing.

By trs80 at 2010-12-21 03:39:18:

This sounds like a specific case of the sysadmin testing problem.

By cks at 2010-12-21 10:48:43:

Full end to end tests of a spares replacement system might be an example of that, but I don't think that code tests are. Code tests can and should be run without actually replacing any spares or looking at any live configuration data; as I've written before, you should be able to give your program the state information for any real or semi-real scenario without having to actually create the scenario.

(This is no different from testing anything else. You almost never do unit tests or functional tests with live production data or copies of it; instead you want something simpler and smaller.)

Written on 20 December 2010.
« A tale of memory allocation failure
Why I want tests to be easy to write »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 20 01:44:36 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.