A lot of my bugs are conceptual oversights

September 30, 2010

This is in part a war story.

We've written our own system to handle deploying spares on our ZFS fileservers. One of the decisions we made was how much activity to have the system start at once, because we already knew that trying to resilver too many mirrors at once killed performance for everyone. What we decided was that we only wanted one resilver to happen at once and further that we would abort any ZFS scrubs that were in progress if we needed to activate a spare, because getting a mirror back to redundancy was more important than a precautionary check of a mirror's consistency.

So I wrote two functions, scrubbing_pools and resilvering_pools. Because you need to know the name of the pool (or pools) that are scrubbing in order to abort the scrub, scrubbing_pools returned a list of pools that were scrubbing. resilvering_pools did too because it was a trivial variant of scrubbing_pools, and why not? The code that refused to start a second resilver when there was one already running used the obvious check of 'is the list of resilvering pools empty?'

(The system is written in Python, so returning lists of names is a natural thing to do.)

Today an entire backend went down and rebooted out of the blue, causing ZFS to declare all of its disks bad, and we needed to replace it with a spare backend that we trusted. This means resilvering every disk currently in use, which we did through the spares system. After a while we decided that resilvering only one disk at a time was too slow and we could probably survive doing three at once, and we wanted the spares system to do the work for us.

So it was time for a quick yet obvious and simple code change; instead of checking for a non-null list, just count how many entries are in it and see if this is larger or equal to a 'maximum resilvers' parameter (which defaulted to 1 for backwards compatibility). We tested this, deployed it, watched it work, and left it going. Tonight, as I checked in on the system state, I realized that there was a bug in what I had done.

Can you see it? (You have a big advantage; I've told you that there is one.)

Here is the bug: we want to limit how many disks are resilvering at once, but the code is counting how many pools are resilvering. If a pool has more than one disk that needs resilvering, the code will wind up happily resilvering all of them at once, no matter how many there are.

There is no coding bug here, and I would argue that there is not even bad design; the code returned resilvering pools instead of disks for completely sensible reasons, and the difference originally didn't matter (a pool that is resilvering has at least one resilvering disk, which meant that we didn't want to start another). The bug is a conceptual oversight, a mismatch between how I thought of the code and how the code was actually working.

Many of the bugs that I find in my code are conceptual oversights, not straightforward errors or mistakes of implementation. How often I hit conceptual oversights is one of the reasons that I am not as enthused about testing (unit and otherwise) as I could be; I don't think that testing, especially unit testing, is a good way to find them. Conceptual oversights are bugs where you don't think about something at all, and if you don't think about something, how can you write tests that check for it? If your tests turn up a conceptual oversight, it is probably a lucky accident.

(This is one argument for a separate testing group, because they may well not make the same conceptual oversight that you did. Especially given that they are not as immersed in the logic of the code as you are.)

Written on 30 September 2010.
« A surprise about random IO versus unnecessary IO
Stopping kernel updates on Ubuntu »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 30 02:37:24 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.