Wandering Thoughts

2014-10-31

A drawback to handling errors via exceptions

Recently I discovered an interesting and long standing bug in DWiki. DWiki is essentially a mature program, so this one was uncovered through the common mechanism of someone using invalid input, in this case a specific sort of invalid URL. DWiki creates time-based views of this blog through synthetic parts of the URLs that end in things like, for example, '.../2014/10/' for entries from October 2014. Someone came along and requested a URL that looked like '.../2014/99/', and DWiki promptly hit an uncaught Python exception (well, technically it was caught and logged by my general error code).

(A mature program usually doesn't have bugs handling valid input, even uncommon valid input. But the many forms of invalid input are often much less well tested.)

To be specific, it promptly coughed up:

calendar.IllegalMonthError: bad month number 99; must be 1-12

Down in the depths of the code that handled a per-month view I was calling calendar.monthrange() to determine how many days a given month has, which was throwing an exception because '99' is of course not a valid month of the year. The exception escaped because I wasn't doing anything in my code to either catch it or not let invalid months get that far in the code.

The standard advantage of handling errors via exceptions definitely applied here. Even though I had totally overlooked this error possibility, the error did not get quietly ignored and go on to corrupt further program state; instead I got smacked over the nose with the existence of this bug so I could find it and fix it. But it also exposes a drawback of handling errors with exceptions, which is that it makes it easier to overlook the possibility of errors because that possibility isn't explicit.

The calendar module doesn't document what exceptions it raises, either in general or especially in the documentation for monthrange() in specific (where it would be easy to spot while reading about the function). Because an exception is effectively an implicit extra return 'value' from functions, it's easy to overlook the possibility that you'll actually get an exception; in Python, there's nothing there to rub your nose in it and make you think about it. And so I never even thought about what happened if monthrange() was handed invalid input, in part because of the usual silent assumption that the code would only be called with valid input because of course DWiki doesn't generate date range URLs with bad months in them.

Explicit error returns may require a bunch of inconvenient work to handle them individually instead of letting you aggregate exception handling together, but the mere presence of an explicit error return in a method's or function's signature serves as a reminder that yes, the function can fail and so you need to handle it. Exceptions for errors are more convenient and more safe for at least casual programming, but they do mean you need to ask yourself what-if questions on a regular basis (here, 'what if the month is out of range?').

(It turns out I've run into this general issue before, although that time the documentation had a prominent notice that I just ignored. The general issue of error handling with exceptions versus explicit returns is on my mind these days because I've been doing a bunch of coding in Go, which has explicit error returns.)

python/ExceptionsOverlookProblem written at 01:00:38; Add Comment

2014-10-29

Quick notes on the Linux iptables 'ipset' extension

For a long time Linux's iptables firewall had an annoying lack in that it had no way to do efficient matching against a set of IP addresses. If you had a lot of IP addresses to match things against (for example if you were firewalling hundreds or thousands of IP addresses and IP address ranges off from your SMTP port), you needed one iptables rule for each entry and then they were all checked sequentially. This didn't make your life happy, to put it one way. In modern Linuxes, ipsets are finally the answer to this; they give you support for efficient sets of various things, including random CIDR netblocks.

(This entry suggests that ipsets only appeared in mainline Linux kernels as of 2.6.39. Ubuntu 12.04, 14.04, Fedora 20, and RHEL/CentOS 7 all have them while RHEL 5 appears to be too old.)

To work with ipsets, the first thing you need is the user level tool for creating and manipulating them. For no particularly sensible reason your Linux distribution probably doesn't install this when you install the standard iptables stuff; instead you'll need to install an additional package, usually called ipset. Iptables itself contains the code to use ipsets, but without ipset to create the sets you can't actually install any rules that use them.

(I wish I was kidding about this but I'm not.)

The basic use of ipsets is to make a set, populate it, and match against it. Let's take an example:

ipset create smtpblocks hash:net counters
ipset add smtpblocks 27.112.32.0/19
ipset add smtpblocks 204.8.87.0/24
iptables -A INPUT -p tcp --dport 25 -m set --match-set smtpblocks src -j DROP

(Both entries are currently on the Spamhaus EDROP list.)

Note that the set must exist before you can add iptables rules that refer to it. The ipset manpage has a long discussion of the various types of sets that you can use and the iptables-extensions manpage has a discussion of --match-set and the SET target for adding entries to sets from iptables rules. The hash:net I'm using here holds random CIDR netblocks (including /32s, ie single hosts) and is set to have counters.

It would be nice if there was a simple command to get just a listing of the members of an ipset. Unfortunately there isn't, as plain 'ipset list' insists on outputting a few lines of summary information before it lists the members. Since I don't know if these are constant I'm using 'ipset list -t save | grep "^add "', which seems ugly but seems likely to keep working forever.

Unfortunately I don't think there's an officially supported and documented ipset command for adding multiple entries into a set at once in a single command invocation; instead you're apparently expected to run 'ipset add ...' repeatedly. You can abuse the 'ipset restore' command for this if you want to by creating appropriately formatted input; check the output of 'ipset save' to see what it needs to look like. This may even be considered a stable interface by the ipset authors.

Ipset syntax and usage appears to have changed over time, so old discussions of it that you find online may not work quite as written (and someday these notes may be out of date that way as well).

PS: I can sort of see a lot of clever uses for ipsets, but I've only started exploring them right now and my iptables usage is fairly basic in general. I encourage you to read the ipset manpage and go wild.

Sidebar: how I think you're supposed to use list sets

As an illustrated example:

ipset create spamhaus-drop hash:net counters
ipset create spamhaus-edrop hash:net counters
[... populate both from spamhaus ...]

ipset create spamhaus list:set
ipset add spamhaus spamhaus-drop
ipset add spamhaus spamhaus-edrop

iptables -A INPUT -p tcp --dport 25 -m set --match-set spamhaus src -j DROP

This way your iptables rules can be indifferent about exactly what goes into the 'spamhaus' ipset, although of course this will be slightly less efficient than checking a single merged set.

linux/IptablesIpsetNotes written at 23:30:13; Add Comment

Unnoticed nonportability in Bourne shell code (and elsewhere)

In response to my entry on how Bashisms in #!/bin/sh scripts aren't necessarily bugs, FiL wrote:

If you gonna use bashism in your script why don't you make it clear in the header specifying #!/bin/bash instead [of] #!/bin/sh? [...]

One of the historical hard problems for Unix portability is people writing non-portable code without realizing it, and Bourne shell code is no exception. This is true for even well intentioned people writing code that they want to be portable.

One problem, perhaps the root problem, is that very little you do on Unix will come with explicit (non-)portability warnings and you almost never have to go out of your way to use non-portable features. This makes it very hard to know whether or not you're actually writing portable code without trying to run it on multiple environments. The other problem is that it's often both hard to remember and hard to discover what is non-portable versus what is portable. Bourne shell programming is an especially good example of both issues (partly because Bourne shell scripts often use a lot of external commands), but there have been plenty of others in Unix's past (including 'all the world's a VAX' and all sorts of 64-bit portability issues in C code).

So one answer to FiL's question is that a lot of people are using bashisms in their scripts without realizing it, just as a lot of people have historically written non-portable Unix C code without intending to. They think they're writing portable Bourne shell scripts, but because their /bin/sh is Bash and nothing in Bash warns about things the issues sail right by. Then one day you wind up changing /bin/sh to be Dash and all sorts of bits of the world explode, sometimes in really obscure ways.

All of this sounds abstract, so let me give you two examples of accidentally Bashisms I've committed. The first and probably quite common one is using '==' instead of '=' in '[ ... ]' conditions. Many other languages use == as their string equality check, so at some point I slipped and started using it in 'Bourne' shell scripts. Nothing complained, everything worked, and I thought my shell scripts were fine.

The second I just discovered today. Bourne shell pattern matching allows character classes, using the usual '[...]' notation, and it even has negated characters classes. This means that you can write something like the following to see if an argument has any non-number characters in it:

case "$arg" in
   *[^0-9]*) echo contains non-number; exit 1;;
esac

Actually I lied in that code. Official POSIX Bourne shell doesn't negate character classes with the usual '^' character that Unix regular expressions use; instead it uses '!'. But Bash accepts '^' as well. So I wrote code that used '^', tested it, had it work, and again didn't realize that I was non-portable.

(Since having a '^' in your character class is not an error in a POSIX Bourne shell, the failure mode for this one is not a straightforward error.)

This is also a good example of how hard it is to test for non-portability, because even when you use 'set -o posix' Bash still accepts and matches this character class in its way (with '^' interpreted as class negation). The only way to test or find this non-portability is to run the script under a different shell entirely. In fact, the more theoretically POSIX compatible shells you test on the better.

(In theory you could try to have a perfect memory for what is POSIX compliant and not need any testing at all, or cross-check absolutely everything against POSIX and never make a mistake. In practice humans can't do that any more than they can write or check perfect code all the time.)

unix/UnnoticedNonportability written at 00:43:47; Add Comment

2014-10-28

My current somewhat tangled feelings on operator.attrgetter

In a comment on my recent entry on sort comparison functions, Peter Donis asked a good question:

Is there a reason you're not using operator.attrgetter for the key functions? It's faster than a lambda.

One answer is that until now I hadn't heard of operator.attrgetter. Now that I have it's something I'll probably consider in the future.

But another answer is embedded in the reason Peter Donis gave for using it. Using operator.attrgetter is clearly a speed optimization, but speed isn't always the important thing. Sometimes, even often, the most important thing to optimize is clarity. Right now, for me attrgetter is less clear than the lambda approach because I've just learned about it; switching to it would probably be a premature optimization for speed at the cost of clarity.

In general, well, 'attrgetter' is a clear enough thing that I suspect I'll never be confused about what 'lst.sort(key=operator.attrgetter("field"))' does, even if I forget about it and then reread some code that uses it; it's just pretty obvious from context and the name itself. There's a visceral bit of me that doesn't like it as much as the lambda approach because I don't think it reads as well, though. It's also more black magic than lambda, since lambda is a general language construct and attrgetter is a magic module function.

(And as a petty thing it has less natural white space. I like white space since it makes things more readable.)

On the whole this doesn't leave me inclined to switch to using attrgetter for anything except performance sensitive code (which these sort()s aren't so far). Maybe this is the wrong decision, and if the Python community as a whole adopts attrgetter as the standard and usual way to do .sort() key access it certainly will become a wrong decision. At that point I hope I'll notice and switch myself.

(This is an sense an uncomfortable legacy of CPython's historical performance issues with Python code. Attrgetter is clearly a performance hack in general; if lambda was just as fast as it I'd argue that you should clearly use lambda because it's a general language feature instead of a narrowly specialized one.)

python/AttrgetterVsLamba written at 00:11:20; Add Comment

2014-10-27

Practical security and automatic updates

One of the most important contributors to practical, real world security is automatically applied updates. This is because most people will not take action to apply security fixes; in fact most people will probably not do so even if asked directly and just required to click 'yes, go ahead'. The more work people have to go through to apply security fixes, the fewer people will do so. Ergo you maximize security fixes when people are required to take no action at all.

(Please note that sysadmins and developers are highly atypical users.)

But this relies on users being willing to automatically apply updates, and that in turn requires that updates must be harmless. The ideal update either changes nothing besides fixing security issues and other bugs or improves the user's life. Updates that complicate the user's life at the same time that they deliver security fixes, like Firefox updates, are relatively bad. Updates that actually harm the user's system are terrible.

Every update that does harm to someone's system is another impetus for people to disable automatic updates. It doesn't matter that most updates are harmless and it doesn't matter that most people aren't affected by even the harmful updates, because bad news is much more powerful than good news. We hear loudly about every update that has problems; we very rarely hear about updates that prevented problems, partly because it's hard to notice when it happens.

(The other really important thing to understand is that mythology is extremely powerful and extremely hard to dislodge. Once mythology has set in that leaving automatic updates on is a good way to get screwed, you have basically lost; you can expect to spend huge amounts of time and effort persuading people otherwise.)

If accidentally harmful updates are bad, actively malicious updates are worse. An automatic update system that allows malicious updates (whether the maliciousness is the removal of features or something worse) is one that destroys trust in it and therefor destroys practical security. As a result, malicious updates demand an extremely strong and immediate response. Sadly they often don't receive one, and especially when the 'update' removes features it's often even defended as a perfectly okay thing. It's not.

PS: corollaries for, say, Firefox and Chrome updates are left as an exercise to the reader. Bear in mind that for many people their web browser is one of the most crucial parts of their computer.

(This issue is why people are so angry about FTDI's malicious driver appearing in Windows Update (and FTDI has not retracted their actions; they promise future driver updates that are almost as malicious as this one). It's also part of why I get so angry when Unix vendors fumble updates.)

tech/UpdatesAndSecurity written at 01:41:09; Add Comment

2014-10-26

Things that can happen when (and as) your ZFS pool fills up

There's a shortage of authoritative information on what actually happens if you fill up a ZFS pool, so here is what I've both gathered about it from other people's information and experienced.

The most often cited problem is bad performance, with the usual cause being ZFS needing to do an increasing amount of searching through ZFS metaslab space maps to find free space. If not all of these are in memory, a write may require pulling some or all of them into memory, searching through them, and perhaps finding not enough space. People cite various fullness thresholds for this starting to happen, eg anywhere from 70% full to 90% full. I haven't seen any discussion about how severe this performance impact is supposed to be (and on what sort of vdevs; raidz vdevs may behave differently than mirror vdevs here).

(How many metaslabs you have turns out to depend on how your pool was created and grown.)

A nearly full pool can also have (and lead to) fragmentation, where the free space is in small scattered chunks instead of large contiguous runs. This can lead to ZFS having to write 'gang blocks', which are a mechanism where ZFS fragments one large logical block into smaller chunks (see eg the mention of them in this entry and this discussion which corrects some bits). Gang blocks are apparently less efficient than regular writes, especially if there's a churn of creation and deletion of them, and they add extra space overhead (which can thus eat your remaining space faster than expected).

If a pool gets sufficiently full, you stop being able to change most filesystem properties; for example, to set or modify the mountpoint or change NFS exporting. In theory it's not supposed to be possible for user writes to fill up a pool that far. In practice all of our full pools here have resulted in being unable to make such property changes (which can be a real problem under some circumstances).

You are supposed to be able to remove files from a full pool (possibly barring snapshots), but we've also had reports from users that they couldn't do so and their deletion attempt failed with 'No space left on device' errors. I have not been able to reproduce this and the problem has always gone away on its own.

(This may be due to a known and recently fixed issue, Illumos bug #4950.)

I've never read reports of catastrophic NFS performance problems for all pools or total system lockup resulting from a full pool on an NFS fileserver. However both of these have happened to us. The terrible performance issue only happened on our old Solaris 10 update 8 fileservers; the total NFS stalls and then system lockups have now happened on both our old fileservers and our new OmniOS based fileservers.

(Actually let me correct that; I've seen one report of a full pool killing a modern system. In general, see all of the replies to my tweeted question.)

By the way: if you know of other issues with full or nearly full ZFS pools (or if you have additional information here in general), I'd love to know more. Please feel free to leave a comment or otherwise get in touch.

solaris/ZFSFullPoolProblems written at 01:35:39; Add Comment

2014-10-25

The difference in available pool space between zfs list and zpool list

For a while I've noticed that 'zpool list' would report that our pools had more available space than 'zfs list' did and I've vaguely wondered about why. We recently had a very serious issue due to a pool filling up, so suddenly I became very interested in the whole issue and did some digging. It turns out that there are two sources of the difference depending on how your vdevs are set up.

For raidz vdevs, the simple version is that 'zpool list' reports more or less the raw disk space before the raidz overhead while 'zfs list' applies the standard estimate that you expect (ie that N disks worth of space will vanish for a raidz level of N). Given that raidz overhead is variable in ZFS, it's easy to see why the two commands are behaving this way.

In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.

(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)

The actual situation with what operations can (or should) use what space is complicated. Roughly speaking, user level writes and ZFS operations like 'zfs create' and 'zfs snapshot' that make things should use the 1/32nd reserved space figure, file removes and 'neutral' ZFS operations should be allowed to use half of the slop space (running the pool down to 1/64th of its size), and some operations (like 'zfs destroy') have no limit whatever and can theoretically run your pool permanently and unrecoverably out of space.

The final authority is the Illumos kernel code and its comments. These days it's on Github so I can just link to the two most relevant bits: spa_misc.c's discussion of spa_slop_shift and dsl_synctask.h's discussion of zfs_space_check_t.

(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)

(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)

solaris/ZFSSpaceReportDifference written at 02:05:09; Add Comment

2014-10-24

In Go I've given up and I'm now using standard packages

In my Go programming, I've come around to an attitude that I'll summarize as 'there's no point in fighting city hall'. What this means is that I'm now consciously using standard packages that I don't particularly like just because they are the standard packages.

I'm on record as disliking the standard flag package, for example, and while I still believe in my reasons for this I've decided that it's simply not worth going out of my way over it. The flag package works and it's there. Similarly, I don't think that the log package is necessarily a great solution for emitting messages from Unix style command line utilities but in my latest Go program I used it anyways. It was there and it wasn't worth the effort to code warn() and die() functions and so on.

Besides, using flag and log is standard Go practice so it's going to be both familiar to and expected by anyone who might look at my code someday. There's a definite social benefit to doing things the standard way for anything that I put out in public, much like most everyone uses gofmt on their code.

In theory I could find and use some alternate getopt package (these days the go to place to find one would be godoc.org). In practice I find using external packages too much of a hassle unless I really need them. This is an odd thing to say about Go, considering that it makes them so easy and accessible, but depending on external packages comes with a whole set of hassles and concerns right now. I've seen a bit too much breakage to want that headache without a good reason.

(This may not be a rational view for Go programming, given that Go deliberately makes using people's packages so easy. Perhaps I should throw myself into using lots of packages just to get acclimatized to it. And in practice I suspect most packages don't break or vanish.)

PS: note that this is different from the people who say you should eg use the testing package for your testing because you don't really need anything more than what it provides and stick with the standard library's HTTP stuff rather than getting a framework. As mentioned, I still think that flag is not the right answer; it's just not wrong enough to be worth fighting city hall over.

Sidebar: Doing standard Unix error and warning messages with log

Here's what I do:

log.SetPrefix("<progname>: ")
log.SetFlags(0)

If I was doing this better I would derive the program name from os.Args[0] instead of hard-coding it, but if I did that I'd have to worry about various special cases and no, I'm being lazy here.

programming/GoUsingStandardPackages written at 01:15:57; Add Comment

2014-10-23

The clarity drawback of allowing comparison functions for sorting

I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.

DWiki's code is sufficiently old that it uses only .sort() cmp functions simply because, well, that's what I had (or at least that's what I was used to). As a result, in two widely scattered spots in different functions its code base contains the following lines:

def func1(...):
    ....
    dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp))
    ....

def func2(...):
    ....
    coms.sort(lambda x,y: cmp(x.time, y.time))
    ....

Apart from the field name, did you see the difference there? I didn't today while I was doing some modernization in DWiki's codebase and converted both of these to the '.sort(key=lambda x: x.FIELD)' form. The difference is that the first is a reverse sort, not a forward sort, because it flips x and y in the cmp().

(This code predates .sort() having a reverse= argument or at least my general awareness and use of it.)

And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:

   dl.sort(key=lambda x: x.timestamp, reverse=True)
   coms.sort(key=lambda x: x.time)

Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.

I now feel that in practice you want to avoid using a comparison function as much as possible even if one exists for exactly this reason. Try very hard to directly say what you mean instead of hiding it inside your cmp function unless there's no way out. A direct corollary of this is that sorting interfaces should try to let you directly express as much as possible instead of forcing you to resort to tricks.

(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)

PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.

python/SortCmpFunctionClarityIssue written at 00:14:32; Add Comment

2014-10-22

Exim's (log) identifiers are basically unique on a given machine

Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?

The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.

(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)

This means that aggregation of multi-line logs can be done with simple brute force approaches that rely on ID uniqueness. Heck, to group all the log lines for a given message together you can just sort on the ID field, assuming you do a stable sort so that things stay in timestamp order when the IDs match.

(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)

PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.

sysadmin/EximLogIdUniqueness written at 00:20:33; Add Comment

(Previous 10 or go back to October 2014 at 2014/10/20)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.