Fixing low command error distances
Suppose that you have a command with an unnervingly low error distance, either because a vendor stuck you with it or because it's the natural way to structure the command's arguments. The way to fix this is to change the sort of error required to make a mistake, so that you move from a likely change to an unlikely one.
(If you are working with a vendor command, you will need to do this with some sort of a cover script or program. If you are working with a local command, you can just change the arguments directly.)
For a concrete example, lets look at the ZFS
zpool command to add a
spare disk to a ZFS pool: '
zpool add POOL spare DEVICE'. Much like
adding mirrors to a ZFS pool, this is one omitted word away from a
potential disaster. The simple fix in a cover script is to change it
to a separate command, making it something like '
sanpool spare POOL
DEVICE'; this changes the error distance from an omitted word to a
changed word, a less likely mistake (especially because the word you'd
have to change is in a sense the focus of what you're doing).
To make the change less likely, modify the command that the cover script uses to expand a ZFS pool; instead of using 'add' (which is general but raises the question of 'add what?'), use 'grow'. Contrast:
sanpool grow POOL DEVICE sanpool spare POOL DEVICE
Now the commands are fairly strongly distinct and harder to substitute for each other, because it is a much bigger mental distance from 'add a spare' to 'grow the pool' than from 'add a spare' to 'add a device'.
(When trying to prevent errors, it is useful to approach the commands from a high level view of what people are trying to do rather than look for low-level similarities in how it gets done. In a sense the way to avoid errors is to avoid similarities in things that are actually different.)
Another other way for cover scripts to help you avoid errors is to
just not allow them to start with. System commands may have to be
general and thus allow even the questionable, but your scripts can
be more restrictive; for example, if you know you should never have
non-redundant ZFS pool devices, you can just make '
The concept of error distance in sysadmin commands
I have recently started thinking about the concept of what I will call the 'error distance' of sysadmin commands: how much do you have to change a perfectly normal command in order to do something undesirable or disastrous (instead of just failing with an error)?
(As an example, consider the ZFS command to expand a ZFS pool with
a new pair of mirrored disks, which is '
zpool add POOL mirror DEV1
DEV2'. If you accidentally omit the '
mirror', you will add two
unmirrored disks to the ZFS pool, and you can't shrink ZFS pools to
remove devices. So the error distance here is one omitted word.)
You want the error distance for commands to be as large as possible, because this avoids accidents when people make their inevitable errors. Low error distance is also more dangerous in commonly used commands than uncommonly used ones, because you are less likely to carefully check a command that you use routinely (especially if you don't consider it inherently dangerous).
When considering the error distance, my belief is that certain sorts of changes are more likely than others (and thus make the error distance closer). My gut says:
- omitting words is more likely than changing words (using 'cat' when
you mean 'dog'), which in turn is more likely than adding words.
(I am not sure where transposing words should fit in, where you write 'cat dog' instead of 'dog cat'.)
- commonly used things are more likely than uncommon things; for example, if you commonly add an option to one command, you are more likely to add it to another command.
(I suspect that this has been studied formally at some point, probably by the HCI/Human Factors people.)