Wandering Thoughts archives

2010-09-30

A lot of my bugs are conceptual oversights

This is in part a war story.

We've written our own system to handle deploying spares on our ZFS fileservers. One of the decisions we made was how much activity to have the system start at once, because we already knew that trying to resilver too many mirrors at once killed performance for everyone. What we decided was that we only wanted one resilver to happen at once and further that we would abort any ZFS scrubs that were in progress if we needed to activate a spare, because getting a mirror back to redundancy was more important than a precautionary check of a mirror's consistency.

So I wrote two functions, scrubbing_pools and resilvering_pools. Because you need to know the name of the pool (or pools) that are scrubbing in order to abort the scrub, scrubbing_pools returned a list of pools that were scrubbing. resilvering_pools did too because it was a trivial variant of scrubbing_pools, and why not? The code that refused to start a second resilver when there was one already running used the obvious check of 'is the list of resilvering pools empty?'

(The system is written in Python, so returning lists of names is a natural thing to do.)

Today an entire backend went down and rebooted out of the blue, causing ZFS to declare all of its disks bad, and we needed to replace it with a spare backend that we trusted. This means resilvering every disk currently in use, which we did through the spares system. After a while we decided that resilvering only one disk at a time was too slow and we could probably survive doing three at once, and we wanted the spares system to do the work for us.

So it was time for a quick yet obvious and simple code change; instead of checking for a non-null list, just count how many entries are in it and see if this is larger or equal to a 'maximum resilvers' parameter (which defaulted to 1 for backwards compatibility). We tested this, deployed it, watched it work, and left it going. Tonight, as I checked in on the system state, I realized that there was a bug in what I had done.

Can you see it? (You have a big advantage; I've told you that there is one.)

Here is the bug: we want to limit how many disks are resilvering at once, but the code is counting how many pools are resilvering. If a pool has more than one disk that needs resilvering, the code will wind up happily resilvering all of them at once, no matter how many there are.

There is no coding bug here, and I would argue that there is not even bad design; the code returned resilvering pools instead of disks for completely sensible reasons, and the difference originally didn't matter (a pool that is resilvering has at least one resilvering disk, which meant that we didn't want to start another). The bug is a conceptual oversight, a mismatch between how I thought of the code and how the code was actually working.

Many of the bugs that I find in my code are conceptual oversights, not straightforward errors or mistakes of implementation. How often I hit conceptual oversights is one of the reasons that I am not as enthused about testing (unit and otherwise) as I could be; I don't think that testing, especially unit testing, is a good way to find them. Conceptual oversights are bugs where you don't think about something at all, and if you don't think about something, how can you write tests that check for it? If your tests turn up a conceptual oversight, it is probably a lucky accident.

(This is one argument for a separate testing group, because they may well not make the same conceptual oversight that you did. Especially given that they are not as immersed in the logic of the code as you are.)

ConceptualBugExample written at 02:37:24; Add Comment

2010-09-19

Your on the fly control system should not use toggles

Here's a request to programmers: if you are building a system with some form of dynamic control and adjustment commands, like Bind 9 and its rndc program for controlling the daemon's behavior on the fly, your interface should not include 'toggles', commands or options that invert the current state of a setting. In rndc, an example is 'rndc querylog'; this turns DNS query logging on if it is off, and turns it off if it is on.

(In fact this extends to startup options too.)

A toggle is a terrible interface because it is a mechanism, not an outcome, and people don't care about the former and do care about the latter. There are basically no situations in which sysadmins want to change the state of query logging without caring what the previous and new states are; instead, sysadmins either want to turn it on or to turn it off. By providing only a mechanism, you force people to either assume what the current state is or to explicitly check it before (and perhaps after) they use your interface.

So please don't do that. Don't build interfaces like 'rndc querylog'; make interfaces that are 'rndc querylog on' and 'rndc querylog off'. Your users will thank you for it.

(Toggles work in GUIs because you can see their state as well as change them, and you see their state before you change them.)

A related sin is interfaces that increase or decrease a setting (such as verbosity or debug level), instead of setting it directly. While there are some uses for these, they should be the secondary way that you change the setting in question, not the primary way; you should always provide a way to directly set the setting's value, and expect it to be used most of the time. Again, usually people know the value that they want. It is rare that increasing or decreasing the level is the actual outcome people want, instead of just the mechanism for getting it.

(The exception is when you are not certain what debug level will start producing the messages you want without also producing too much detail. There it makes sense to increase the debug level step by step until you hit the right point. But frankly, with modern shells with command line editing you can do this almost as well with a 'set to level X' interface.)

Also, a personal note: sysadmins will kill you if the only way to disable debugging is to repeatedly run 'cmd decrease-debugging' until it stops entirely. Well, really we'll kill your program and restart it, but we'll sure wish that we could say very grumpy things to you.

OnTheFlyNoToggles written at 23:10:41; Add Comment

2010-09-10

Go's network package and IPv6 (and my ideal version thereof)

As the result of making a couple of chainsaw passes at really fixing my issues from yesterday, I've now got a better idea of what Go's network package is doing with IPv6 and what I think I want it to do.

Go's net package has a very Plan 9 (and Research Unix) approach to networking APIs, where it makes no particular attemtp to expose some version of the standard sockets API but instead goes its own, higher level way. As part of this it has three flavours of each sort of IP-based networking; for TCP, for example, these are tcp4 ('IPv4-only'), tcp6 ('IPv6-only'), and the generic tcp (cf the description of net.Dial()).

The question is what Unix socket type you use for each setting; IPv4, IPv6 with dual binding turned off, or IPv6 with dual binding turned on. How I think it would ideally work is this:

  • the *4 flavours use AF_INET (IPv4) sockets. (This is the current behavior and works fine.)

  • the *6 flavours use AF_INET6 (IPv6) sockets with dual binding off. This is consistent with the net package's documentation and is the only way to allow you to listen separately for IPv6 and IPv4 versions of the same port (including cooperating with non-Go programs that are listening on the IPv4 version of a port).

  • the generic flavours use IPv6 sockets with dual binding on (because that is the 'allows anything' option), with two exceptions. They use IPv4 sockets if either the system doesn't have IPv6 enabled at all or the connection involves only IPv4 addresses (because I think that the socket type used should match the actual wire level protocol that will get used when we know what it is).

The one wrinkle with getting to my ideal situation is that the current Go self tests explicitly make connections to IPv4 addresses using tcp6 networking. This is intrinsically unportable, as there are Unixes where you cannot enable dual binding (OpenBSD is the one I know of, although Go has not been ported to it). If you ignore that, you can make the code turn dual binding on for this case when you detect it.

(Sufficiently perverse tests will defeat this.)

Right now what Go does for the non *4 flavours on an IPv6-enabled system is to use IPv6 with the system default dual binding settings; this works okay on a machine that defaults to dual binding being on, but very badly on a machine with it defaulting to off. The Go people seem to want to turn dual binding on all the time when using IPv6 sockets in order to fix this; I maintain that this is the wrong decision, and besides it's not completely portable (cf OpenBSD).

I have a hacked up version of the Go net package that is most of the way to my ideal situation, but I think I'm missing a corner case or two. It turned out to be relatively easy to do but also somewhat ugly.

(As a language, I am half enthused and half unenthused about Go for reasons that don't fit within the margins of this entry.)

GoIpv6MyDesire written at 01:32:53; Add Comment

2010-09-04

The laziness of a programmer, illustrated

At work, I have fallen into the bad habit of keeping a lot of iconified Firefox windows around, full of various things that I am going to read sometime (honest). As I've mentioned before, I have all of these iconified windows very carefully placed and organized so that I can find them again and keep track of them.

Naturally, this makes quitting and restarting Firefox kind of a pain. I have Firefox set to preserve all of the active windows and tabs over restarts, but it doesn't preserve the positions of the iconified windows (and it doesn't entirely preserve the regular window position either); any time I have to start Firefox again I have to re-position all of those icons. Generally this means that I don't; I never exit Firefox unless I'm forced to, because it's such a pain to get everything set up again.

(Which implies that I never log out, either; I just leave my screen locked.)

Recently I got tired of this (in the aftermath of my Fedora 13 upgrade, I've been restarting things more than usual). Thus I decided that clearly there had to be a way to fish around in the depths of X to find the current icon positions, so I could write a quick script that recorded them in a file and then shuffled the icons back into the right spots for me.

(This is less crazy than it sounds; I already have command line utilities to reposition windows, and X comes with a fair number of commands to poke at various aspects of window state.)

I'll cut to the chase: yes, except that it wasn't exactly a quick script. The most convenient way of doing this turned out to be writing an FVWM module in Perl that finds out all of this information and writes a file of FVWM commands that can be loaded back in to FVWM to (re)position and (re)iconify all of my Firefox windows just right. In the process of doing this I had to remember my Perl, look up a certain amount of Perl's OO support (my last serious Perl programming pretty much predates it), and figure out how to work with FVWM's underdocumented Perl bindings.

(FVWM has no current Python bindings for would be module authors.)

But all of this was less work than continuing to re-position all of my Firefox windows by hand. Honest.

(The resulting module is sort of theoretically general. If you are really interested, see here. As a bonus, you get to laugh at my hack-job Perl.)

ProgrammerLaziness written at 01:10:22; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.