A drawback to handling errors via exceptions
Recently I discovered an interesting and long standing bug in DWiki. DWiki is essentially a mature program, so this one was uncovered through the common mechanism of someone using invalid input, in this case a specific sort of invalid URL. DWiki creates time-based views of this blog through synthetic parts of the URLs that end in things like, for example, '.../2014/10/' for entries from October 2014. Someone came along and requested a URL that looked like '.../2014/99/', and DWiki promptly hit an uncaught Python exception (well, technically it was caught and logged by my general error code).
(A mature program usually doesn't have bugs handling valid input, even uncommon valid input. But the many forms of invalid input are often much less well tested.)
To be specific, it promptly coughed up:
calendar.IllegalMonthError: bad month number 99; must be 1-12
Down in the depths of the code that handled a per-month view I was
calendar.monthrange() to determine how many days a given month
has, which was throwing an exception because '99' is of course not a
valid month of the year. The exception escaped because I wasn't doing
anything in my code to either catch it or not let invalid months get
that far in the code.
The standard advantage of handling errors via exceptions definitely applied here. Even though I had totally overlooked this error possibility, the error did not get quietly ignored and go on to corrupt further program state; instead I got smacked over the nose with the existence of this bug so I could find it and fix it. But it also exposes a drawback of handling errors with exceptions, which is that it makes it easier to overlook the possibility of errors because that possibility isn't explicit.
The calendar module doesn't document
what exceptions it raises, either in general or especially in the
monthrange() in specific (where it would be easy
to spot while reading about the function). Because an exception is
effectively an implicit extra return 'value' from functions, it's
easy to overlook the possibility that you'll actually get an exception;
in Python, there's nothing there to rub your nose in it and make you
think about it. And so I never even thought about what happened if
monthrange() was handed invalid input, in part because of the
usual silent assumption that the
code would only be called with valid input because of course DWiki
doesn't generate date range URLs with bad months in them.
Explicit error returns may require a bunch of inconvenient work to handle them individually instead of letting you aggregate exception handling together, but the mere presence of an explicit error return in a method's or function's signature serves as a reminder that yes, the function can fail and so you need to handle it. Exceptions for errors are more convenient and more safe for at least casual programming, but they do mean you need to ask yourself what-if questions on a regular basis (here, 'what if the month is out of range?').
(It turns out I've run into this general issue before, although that time the documentation had a prominent notice that I just ignored. The general issue of error handling with exceptions versus explicit returns is on my mind these days because I've been doing a bunch of coding in Go, which has explicit error returns.)
Quick notes on the Linux iptables 'ipset' extension
For a long time Linux's iptables firewall had an annoying lack in that it had no way to do efficient matching against a set of IP addresses. If you had a lot of IP addresses to match things against (for example if you were firewalling hundreds or thousands of IP addresses and IP address ranges off from your SMTP port), you needed one iptables rule for each entry and then they were all checked sequentially. This didn't make your life happy, to put it one way. In modern Linuxes, ipsets are finally the answer to this; they give you support for efficient sets of various things, including random CIDR netblocks.
(This entry suggests that ipsets only appeared in mainline Linux kernels as of 2.6.39. Ubuntu 12.04, 14.04, Fedora 20, and RHEL/CentOS 7 all have them while RHEL 5 appears to be too old.)
To work with ipsets, the first thing you need is the user level tool for
creating and manipulating them. For no particularly sensible reason your
Linux distribution probably doesn't install this when you install the
standard iptables stuff; instead you'll need to install an additional
package, usually called
ipset. Iptables itself contains the code to
use ipsets, but without
ipset to create the sets you can't actually
install any rules that use them.
(I wish I was kidding about this but I'm not.)
The basic use of ipsets is to make a set, populate it, and match against it. Let's take an example:
ipset create smtpblocks hash:net counters ipset add smtpblocks 188.8.131.52/19 ipset add smtpblocks 184.108.40.206/24 iptables -A INPUT -p tcp --dport 25 -m set --match-set smtpblocks src -j DROP
(Both entries are currently on the Spamhaus EDROP list.)
Note that the set must exist before you can add iptables rules that
refer to it. The
ipset manpage has a long discussion of the various
types of sets that you can use and the
iptables-extensions manpage has
a discussion of
--match-set and the
SET target for adding entries
to sets from iptables rules. The
hash:net I'm using here holds random
CIDR netblocks (including /32s, ie single hosts) and is set to have
It would be nice if there was a simple command to get just a listing of
the members of an ipset. Unfortunately there isn't, as plain '
list' insists on outputting a few lines of summary information before
it lists the members. Since I don't know if these are constant I'm using
ipset list -t save | grep "^add "', which seems ugly but seems likely
to keep working forever.
Unfortunately I don't think there's an officially supported and
ipset command for adding multiple entries into a set at
once in a single command invocation; instead you're apparently expected
to run '
ipset add ...' repeatedly. You can abuse the '
command for this if you want to by creating appropriately formatted
input; check the output of '
ipset save' to see what it needs to look
like. This may even be considered a stable interface by the
Ipset syntax and usage appears to have changed over time, so old discussions of it that you find online may not work quite as written (and someday these notes may be out of date that way as well).
PS: I can sort of see a lot of clever uses for ipsets, but I've only
started exploring them right now and my iptables usage is fairly basic
in general. I encourage you to read the
ipset manpage and go wild.
Sidebar: how I think you're supposed to use list sets
As an illustrated example:
ipset create spamhaus-drop hash:net counters ipset create spamhaus-edrop hash:net counters [... populate both from spamhaus ...] ipset create spamhaus list:set ipset add spamhaus spamhaus-drop ipset add spamhaus spamhaus-edrop iptables -A INPUT -p tcp --dport 25 -m set --match-set spamhaus src -j DROP
This way your iptables rules can be indifferent about exactly what goes into the 'spamhaus' ipset, although of course this will be slightly less efficient than checking a single merged set.
Unnoticed nonportability in Bourne shell code (and elsewhere)
In response to my entry on how Bashisms in
#!/bin/sh scripts aren't
necessarily bugs, FiL wrote:
If you gonna use bashism in your script why don't you make it clear in the header specifying #!/bin/bash instead [of] #!/bin/sh? [...]
One of the historical hard problems for Unix portability is people writing non-portable code without realizing it, and Bourne shell code is no exception. This is true for even well intentioned people writing code that they want to be portable.
One problem, perhaps the root problem, is that very little you do on Unix will come with explicit (non-)portability warnings and you almost never have to go out of your way to use non-portable features. This makes it very hard to know whether or not you're actually writing portable code without trying to run it on multiple environments. The other problem is that it's often both hard to remember and hard to discover what is non-portable versus what is portable. Bourne shell programming is an especially good example of both issues (partly because Bourne shell scripts often use a lot of external commands), but there have been plenty of others in Unix's past (including 'all the world's a VAX' and all sorts of 64-bit portability issues in C code).
So one answer to FiL's question is that a lot of people are using
bashisms in their scripts without realizing it, just as a lot of
people have historically written non-portable Unix C code without
intending to. They think they're writing portable Bourne shell scripts,
but because their
/bin/sh is Bash and nothing in Bash warns about
things the issues sail right by. Then one day you wind up changing
/bin/sh to be Dash and all sorts of bits of the world explode,
sometimes in really obscure ways.
All of this sounds abstract, so let me give you two examples of
accidentally Bashisms I've committed. The first and probably quite
common one is using '
==' instead of '
=' in '
[ ... ]' conditions.
Many other languages use
== as their string equality check, so at some
point I slipped and started using it in 'Bourne' shell scripts. Nothing
complained, everything worked, and I thought my shell scripts were fine.
The second I just discovered today. Bourne shell pattern matching allows
character classes, using the usual '
[...]' notation, and it even has
negated characters classes. This means that you can write something like
the following to see if an argument has any non-number characters in it:
case "$arg" in *[^0-9]*) echo contains non-number; exit 1;; esac
Actually I lied in that code. Official POSIX Bourne shell doesn't
negate character classes with the usual '
^' character that Unix
regular expressions use; instead it uses '
!'. But Bash accepts
^' as well. So I wrote code that used '
^', tested it, had it
work, and again didn't realize that I was non-portable.
(Since having a '
^' in your character class is not an error in
a POSIX Bourne shell, the failure mode for this one is not a
This is also a good example of how hard it is to test for
non-portability, because even when you use '
set -o posix' Bash
still accepts and matches this character class in its way (with
^' interpreted as class negation). The only way to test or find
this non-portability is to run the script under a different shell
entirely. In fact, the more theoretically POSIX compatible shells
you test on the better.
(In theory you could try to have a perfect memory for what is POSIX compliant and not need any testing at all, or cross-check absolutely everything against POSIX and never make a mistake. In practice humans can't do that any more than they can write or check perfect code all the time.)
My current somewhat tangled feelings on
Is there a reason you're not using operator.attrgetter for the key functions? It's faster than a lambda.
One answer is that until now I hadn't heard of
Now that I have it's something I'll probably consider in the future.
But another answer is embedded in the reason Peter Donis gave for
using it. Using
operator.attrgetter is clearly a speed optimization,
but speed isn't always the important thing. Sometimes, even often,
the most important thing to optimize is clarity. Right now, for me
attrgetter is less clear than the
lambda approach because I've
just learned about it; switching to it would probably be a premature
optimization for speed at the cost of clarity.
In general, well, 'attrgetter' is a clear enough thing
that I suspect I'll never be confused about what
lst.sort(key=operator.attrgetter("field"))' does, even if I forget
about it and then reread some code that uses it; it's just pretty
obvious from context and the name itself. There's a visceral bit of
me that doesn't like it as much as the
lambda approach because I
don't think it reads as well, though. It's also more black magic than
lambda is a general language construct and attrgetter
is a magic module function.
(And as a petty thing it has less natural white space. I like white space since it makes things more readable.)
On the whole this doesn't leave me inclined to switch to using
attrgetter for anything except performance sensitive code (which these
sort()s aren't so far). Maybe this is the wrong decision, and if the
Python community as a whole adopts attrgetter as the standard and usual
way to do
.sort() key access it certainly will become a wrong decision.
At that point I hope I'll notice and switch myself.
(This is an sense an uncomfortable legacy of CPython's historical
performance issues with Python code. Attrgetter is clearly a performance
hack in general; if
lambda was just as fast as it I'd argue that you
should clearly use
lambda because it's a general language feature
instead of a narrowly specialized one.)
Practical security and automatic updates
One of the most important contributors to practical, real world security is automatically applied updates. This is because most people will not take action to apply security fixes; in fact most people will probably not do so even if asked directly and just required to click 'yes, go ahead'. The more work people have to go through to apply security fixes, the fewer people will do so. Ergo you maximize security fixes when people are required to take no action at all.
(Please note that sysadmins and developers are highly atypical users.)
But this relies on users being willing to automatically apply updates, and that in turn requires that updates must be harmless. The ideal update either changes nothing besides fixing security issues and other bugs or improves the user's life. Updates that complicate the user's life at the same time that they deliver security fixes, like Firefox updates, are relatively bad. Updates that actually harm the user's system are terrible.
Every update that does harm to someone's system is another impetus for people to disable automatic updates. It doesn't matter that most updates are harmless and it doesn't matter that most people aren't affected by even the harmful updates, because bad news is much more powerful than good news. We hear loudly about every update that has problems; we very rarely hear about updates that prevented problems, partly because it's hard to notice when it happens.
(The other really important thing to understand is that mythology is extremely powerful and extremely hard to dislodge. Once mythology has set in that leaving automatic updates on is a good way to get screwed, you have basically lost; you can expect to spend huge amounts of time and effort persuading people otherwise.)
If accidentally harmful updates are bad, actively malicious updates are worse. An automatic update system that allows malicious updates (whether the maliciousness is the removal of features or something worse) is one that destroys trust in it and therefor destroys practical security. As a result, malicious updates demand an extremely strong and immediate response. Sadly they often don't receive one, and especially when the 'update' removes features it's often even defended as a perfectly okay thing. It's not.
PS: corollaries for, say, Firefox and Chrome updates are left as an exercise to the reader. Bear in mind that for many people their web browser is one of the most crucial parts of their computer.
(This issue is why people are so angry about FTDI's malicious driver appearing in Windows Update (and FTDI has not retracted their actions; they promise future driver updates that are almost as malicious as this one). It's also part of why I get so angry when Unix vendors fumble updates.)
Things that can happen when (and as) your ZFS pool fills up
There's a shortage of authoritative information on what actually happens if you fill up a ZFS pool, so here is what I've both gathered about it from other people's information and experienced.
The most often cited problem is bad performance, with the usual cause being ZFS needing to do an increasing amount of searching through ZFS metaslab space maps to find free space. If not all of these are in memory, a write may require pulling some or all of them into memory, searching through them, and perhaps finding not enough space. People cite various fullness thresholds for this starting to happen, eg anywhere from 70% full to 90% full. I haven't seen any discussion about how severe this performance impact is supposed to be (and on what sort of vdevs; raidz vdevs may behave differently than mirror vdevs here).
(How many metaslabs you have turns out to depend on how your pool was created and grown.)
A nearly full pool can also have (and lead to) fragmentation, where the free space is in small scattered chunks instead of large contiguous runs. This can lead to ZFS having to write 'gang blocks', which are a mechanism where ZFS fragments one large logical block into smaller chunks (see eg the mention of them in this entry and this discussion which corrects some bits). Gang blocks are apparently less efficient than regular writes, especially if there's a churn of creation and deletion of them, and they add extra space overhead (which can thus eat your remaining space faster than expected).
If a pool gets sufficiently full, you stop being able to change most filesystem properties; for example, to set or modify the mountpoint or change NFS exporting. In theory it's not supposed to be possible for user writes to fill up a pool that far. In practice all of our full pools here have resulted in being unable to make such property changes (which can be a real problem under some circumstances).
You are supposed to be able to remove files from a full pool (possibly barring snapshots), but we've also had reports from users that they couldn't do so and their deletion attempt failed with 'No space left on device' errors. I have not been able to reproduce this and the problem has always gone away on its own.
(This may be due to a known and recently fixed issue, Illumos bug #4950.)
I've never read reports of catastrophic NFS performance problems for all pools or total system lockup resulting from a full pool on an NFS fileserver. However both of these have happened to us. The terrible performance issue only happened on our old Solaris 10 update 8 fileservers; the total NFS stalls and then system lockups have now happened on both our old fileservers and our new OmniOS based fileservers.
By the way: if you know of other issues with full or nearly full ZFS pools (or if you have additional information here in general), I'd love to know more. Please feel free to leave a comment or otherwise get in touch.
The difference in available pool space between
zfs list and
For a while I've noticed that '
zpool list' would report that our pools
had more available space than '
zfs list' did and I've vaguely wondered
about why. We recently had a very serious issue due to a pool filling
up, so suddenly I became very interested in the whole issue and did
some digging. It turns out that there are two sources of the difference
depending on how your vdevs are set up.
For raidz vdevs, the simple version is that '
zpool list' reports more
or less the raw disk space before the raidz overhead while '
applies the standard estimate that you expect (ie that N disks worth of
space will vanish for a raidz level of N). Given that raidz overhead is
variable in ZFS, it's easy to see why the two commands are behaving this
In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.
(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)
The actual situation with what operations can (or should) use what space
is complicated. Roughly speaking, user level writes and ZFS operations
zfs create' and '
zfs snapshot' that make things should use the
1/32nd reserved space figure, file removes and 'neutral' ZFS operations
should be allowed to use half of the slop space (running the pool down
to 1/64th of its size), and some operations (like '
zfs destroy') have
no limit whatever and can theoretically run your pool permanently and
unrecoverably out of space.
The final authority is the Illumos kernel code and its comments. These
days it's on Github so I can just link to the two most relevant bits:
spa_misc.c's discussion of
and dsl_synctask.h's discussion of
(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)
(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)
In Go I've given up and I'm now using standard packages
In my Go programming, I've come around to an attitude that I'll summarize as 'there's no point in fighting city hall'. What this means is that I'm now consciously using standard packages that I don't particularly like just because they are the standard packages.
I'm on record as disliking the standard
package, for example, and while I still believe in my reasons
for this I've decided that it's
simply not worth going out of my way over it. The flag package
works and it's there. Similarly, I don't think that the
package is necessarily a great solution for emitting messages from
Unix style command line utilities but in my latest Go program I used it anyways. It
was there and it wasn't worth the effort to code
functions and so on.
log is standard Go practice so it's going
to be both familiar to and expected by anyone who might look at my code
someday. There's a definite social benefit to doing things the standard
way for anything that I put out in public, much like most everyone uses
gofmt on their code.
In theory I could find and use some alternate getopt package (these days the go to place to find one would be godoc.org). In practice I find using external packages too much of a hassle unless I really need them. This is an odd thing to say about Go, considering that it makes them so easy and accessible, but depending on external packages comes with a whole set of hassles and concerns right now. I've seen a bit too much breakage to want that headache without a good reason.
(This may not be a rational view for Go programming, given that Go deliberately makes using people's packages so easy. Perhaps I should throw myself into using lots of packages just to get acclimatized to it. And in practice I suspect most packages don't break or vanish.)
PS: note that this is different from the people who say you should eg
testing package for your testing because you don't really
need anything more than what it provides and stick with the standard
library's HTTP stuff rather than getting a framework. As mentioned, I
still think that
flag is not the right answer; it's just not wrong
enough to be worth fighting city hall over.
Sidebar: Doing standard Unix error and warning messages with
Here's what I do:
log.SetPrefix("<progname>: ") log.SetFlags(0)
If I was doing this better I would derive the program name from
os.Args instead of hard-coding it, but if I did that I'd have to
worry about various special cases and no, I'm being lazy here.
The clarity drawback of allowing comparison functions for sorting
I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.
DWiki's code is sufficiently old that it uses only
functions simply because, well, that's what I had (or at least
that's what I was used to). As a result, in two widely scattered
spots in different functions its code base contains the following
def func1(...): .... dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp)) .... def func2(...): .... coms.sort(lambda x,y: cmp(x.time, y.time)) ....
Apart from the field name, did you see the difference there? I didn't
today while I was doing some modernization in DWiki's codebase and
converted both of these to the '
.sort(key=lambda x: x.FIELD)'
form. The difference is that the first is a reverse sort, not a
forward sort, because it flips
y in the
(This code predates
.sort() having a
reverse= argument or at least
my general awareness and use of it.)
And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:
dl.sort(key=lambda x: x.timestamp, reverse=True) coms.sort(key=lambda x: x.time)
Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.
I now feel that in practice you want to avoid using a comparison
function as much as possible even if one exists for exactly this
reason. Try very hard to directly say what you mean instead of
hiding it inside your
cmp function unless there's no way out.
A direct corollary of this is that sorting interfaces should
try to let you directly express as much as possible instead of
forcing you to resort to tricks.
(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)
PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.
Exim's (log) identifiers are basically unique on a given machine
Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?
The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.
(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)
This means that aggregation of multi-line logs can be done with
simple brute force approaches that rely on ID uniqueness. Heck, to
group all the log lines for a given message together you can just
sort on the ID field, assuming you do a stable sort so that things
stay in timestamp order when the IDs match.
(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)
PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.