Multiple set matches with Linux's iptables 'ipset' extension
Recently, Luke Atchew approached me on Twitter to ask if I had any ideas for solving an ipset challenge. Suppose that you have a set of origin hosts, a set of destination servers, and you want to allow traffic from the origin hosts to the destination servers while blocking all other traffic. As Luke noted, most ipset examples, including mine, are about blocking access to or from a set of hosts; they don't have a matrix setup like Luke's.
One obvious option is a complex multi-rule setup, where you have a
series of rules that try to block and accept more and more of the
traffic that you want; for example, you can start out by blocking
all traffic to '
! -match-set Destinations dst'. This gets complex
if you have other rules involved, though. Another option is one of
ipset's more complicated set types, such as
then you get into hassles populating and maintaining the set (since
it's basically the cross product of your allowed source and destination
As Luke Atchew discovered while working on this, all of this complexity is unnecessary because you can match against multiple ipset sets in the same iptables rule. It is perfectly legitimate to write rules like:
iptables -A FORWARD -m set --match-set Locals src -m set --match-set Remotes dst -j ACCEPT iptables -A FORWARD -m set ! --match-set Locals src -m set --match-set Remotes dst -j DROP
(These rules use different keys, here the source and destination
IPs, but it's probably legitimate to have several
that reuse the same key. You might need a source IP to be in an
'allowed' set and not in a 'temporarily blocked' set, for example.)
An important note here is that you absolutely have to use two '
set' arguments, one before each
--match-set. If you leave the
second one out you will get the somewhat obscure error message
iptables v1.4.21: --match-set can be specified only once', which
may mislead you into believing that this isn't supported at all.
This is a really easy mistake to make because it sure smells like
-m set is surplus; after all, you already told
you wanted to use the ipset extension here.
(I assume that the second '
-m set' is causing the parser to start
setting up a new internal ipset matching operation instead of
accumulating more options for the first one.)
I've finally turned SELinux fully off even on my laptop
As I've mentioned before, I started out with
SELinux turned on on my laptop because it's essentially a stock
Fedora install and that's how Fedora defaults, and using SELinux
felt virtuous. Last year I reached the end
of my patience with running SELinux in
enforcing mode, where it
actually denies access to things; instead I switched it to
where it just whines about things that it would have forbidden and
then a whole complicated pile of software springs into action to
tell you about these audit failures with notifications, popup dialogs
and so on.
Today I gave up on that. My laptop
now has SELinux disabled entirely (as my desktop machines have for
years). The cause is simple: too many SELinux violations kept
happening and especially too many new and different ones kept coming
up. I am only willing to play whack a mole on notification alerts for
so long before I stop caring entirely, and I reached that point today.
The simplest and most easily reversed way to stop getting notifications
about SELinux violations is to set the SELinux policy to
/etc/selinux/config, so that's what I did.
It's possible that some of the problem is due to just upgrading to
Fedora 22 with
yum instead of, say,
fedup, and perhaps it could
be patched up somewhat with '
restorecon -R /'. Perhaps a wholesale
reinstall would reduce it even more (at the cost of putting me
through a wholesale reinstall and then figuring out how to set up
my environment and my account and keys
and wifi access and VPNs and so on all over again). Certainly I
assume that SELinux has to work for some people on Fedora. But I
no longer care. I am done with being quixotically virtuous and
suffering for it.
(I originally put a rant about Fedora and SELinux here, but after
thinking about it I took it out again. It's nothing I haven't said
before and I can't be sure that my SELinux
problems would still be there if I did absolutely everything the
officially approved Fedora way. Since I'm never going to stop eg
doing Fedora version updates with
yum, well, that case will never
apply to me.)
test limitation and the brute force way around it
Suppose that you are writing a Bash script (specifically Bash) and
that you want to get a number from a command that may fail and might
return output '0' (which is a bad value here).
No problem, you say, you can write this like so:
sz=$(zpool list -Hp -o size $pool) if [ $? -ne 0 -o "$sz" -eq 0 ]; then echo failure .... fi
Because you are a smart person, you test that this does the right thing when it fails. Lo and behold:
$ ./scrpt badarg ./scrpt: line 3: [: : integer expression expected
At one level, this is a 'well of course';
-eq specifically requires
numbers on both sides, and when the command fails it does not output
a number (in fact
$sz winds up empty). At another level it is very
annoying, because what we want here is the common short-circuiting
The reason we're not getting the behavior we want is that
(in the built in form in Bash) is parsing and validating its entire
set of arguments before it starts determining the boolean values
of the overall expression. This is not necessarily a bad idea (and
test has a bunch of smart argument processing), but it is inconvenient.
(Note that Bash doesn't claim that
are short-circuiting operators. In fact the idea is relatively
meaningless in the context of
test, since there's relatively
little to short-circuit. A quick test suggests that at least some
versions of Bash check every condition, eg
stat() files, even
when they could skip some.)
My brute force way around this was:
if [ $? -ne 0 -o -z "$sz" ] || [ "$sz" -eq 0 ]; then .... fi
[ is sort of just another program, so it's perfectly
valid to chain
[ invocations together with the shell's actual
short circuiting logical operators. This way the second
even get run if
$sz looks bad, so it can't complain about 'integer
(This may not be the right way to do it. I just felt like using brute force at the time.)
PS: Given that Bash emits this error message whether you like it
or not, it would be nice if it had a
test operator for 'this
thing is actually a number'. My current check here is a bit of
a hack, as it assumes
zpool emits either a number or nothing.
(Updated: minor wording clarification due to reddit, because they're right, 'return 0' is the wrong way to phrase that; I knew what I meant but I can't expect other people to.)
Modern *BSDs have a much better init system than I was expecting
For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what
was essentially the classical BSD init system, with all of its weaknesses. They made
things a little bit simpler by having things like a configuration
file where you could set whether standard daemons were started or
not (and what arguments they got), instead of having to hand edit
/etc/rc, but that was about the extent of their niceness.
When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that
I got used to (to the extent that I had anything to do with it at
Well, guess what. While I wasn't looking, the *BSDs have introduced
a much better system called
rc.d system is basically
a lightweight version of System V init; it strips out all of the
rcN.d directories, SNN and KNN symlinks, and so on to
wind up with just shell scripts in
/etc/rc.d and some additional
As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.
Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.
(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)
Why System V init's split scripts approach is better than classical BSD
Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.
The classical BSD approach to system init is that there are
/etc/rc.local shell scripts that are run on boot. All daemon
starting and other boot time processing is done from one or the
other. There is no special shutdown processing; to shut the machine
down you just kill all of the processes (and then make a system
call to actually reboot). This has the positive virtue that it's
really simple, but it's got some drawbacks.
This approach works fine starting the system (orderly system shutdown
was out of scope originally). It also works fine for restarting
daemons, provided that your daemons are single process things that
can easily be shut down with '
kill' and then restarted with more
or less '
daemon &'. Initially this was the case in 4.xBSD, but
as time went on and Unix vendors added complications like NFS, more
and more things departed from this simple 'start a process; kill a
process; start a process again' model of starting and restarting.
The moment people started to have more complicated startup and
shutdown needs than '
kill' and '
daemon &', we started to have
problems. Either you carefully memorized all of this stuff or you
kept having to read
/etc/rc to figure out what to do to restart
or redo thing X. Does something need a multi-step startup? You're
going to be entering those multiple steps yourself. Does something
need you to kill four or five processes to shut it down properly?
Get used to doing that, and don't forget one. All of this was a
pain even in the best cases (which was single daemon processes that
merely required the right magic command line arguments).
(In practice people not infrequently wrote their own scripts that
did all of this work, then ran the scripts from
/etc/rc.local. But there was always a temptation to skip that
step because after all your thing was so short, you could put it
By contrast, the System V init approach of separate scripts puts
that knowledge into reusable components. Need to stop or start or
restart something? Just run '
/etc/init.d/<whatever> <what>' and
you're done. What the
init.d scripts are called is small enough
knowledge that you can probably keep it in your head, and if you
forget it's usually easy enough to look it up with an
Of course you don't need the full complexity of System V init in
order to realize these advantages. In fact, back in the long ago
days when I dealt with a classical BSD init system I decided that
the split scripts approach was such a big win that I was willing
to manually split up
/etc/rc into separate scripts just to get a
rough approximation of it. The result was definitely worth the
effort; it made my sysadmin life much easier.
(This manual split of much of
/etc/rc is the partial init system
I mentioned here.)
Thinking about people's SSD inflection points in general
What I'm calling the (or a) SSD inflection point is the point where SSDs get big enough and cheap enough for people to switch from spinning rust to SSDs. Of course this is already happening for some people some of the time, so the real question is when it's going to happen for lots of people.
(Well, my real question is when it's going to happen for me, but that's another entry.)
I don't have any answers. I don't even have any particular guesses or opinions. What I do have is an obvious observation.
For most people and most systems, the choice of HDs versus SSDs is not about absolute performance (which clearly goes to SSDs today) or absolute space (which is still by far in the hands of HDs both in terms of price per GB and how much TB you can get in N drives). Instead, unsurprisingly, it is about getting enough space and then if possible making it go faster. People can and do make tradeoffs there based on their feelings about the relative importance of more space and more speed, including ones that make their systems more complicated (like having a small, affordable SSD for speed while offloading much of your data to a slow(er) but big HD). This makes the inflection point complicated and thus the migration from HDs to SSDs is probably going to be a drawn out affair.
We've already seen one broad inflection point happen here, in good laptops; big enough SSDs have mostly displaced HDs, even though people may not have all the space they want. I doubt many laptop users would trade back, even if they have to carefully manage disk space on their laptop SSD.
My suspicion is that the next inflection point will hit when affordable SSDs become big enough to hold all of the data a typical person puts on their computer; at that point you can give people much faster computers without them really noticing any drawbacks. But I don't have any idea how much space that is today; a TB? A couple of TB? Less than a TB for many people?
(My impression is that for many people the major space consumer on home machines is computer games. I'm probably out of touch on this.)
Sometimes looking into spam for a blog entry has unexpected benefits
Today, I was all set to write an entry about how I especially hate slimy companies that gain access to people's address books. In fact I had a particular company in mind, because it's clear that they did this to one of our users recently. As part of starting to write that entry, I decided to do some due diligence research on the company involved. What I found turned out to be rather more alarming than I expected.
There are two usual run of the mill ways to steal people's address books. The 'not even sort of theft' way is to just ask people to give you their address books so you can connect them to any of their friends on your service, and then perhaps send some invitation mails yourself. The underhanded way is to persuade people to give you access to their GMail or Yahoo or whatever email account for some innocent-sounding purpose, then take a copy of their address book while you're there.
These people went the extra mile; they made a browser extension. Of course it does a lot more than just take copies of your address book and none of what it does seems particularly pleasant (at least to me). Getting a browser extension into people's browsers is probably harder than getting their address books in the usual way, but I imagine it's much more lucrative (and much more damaging).
What this means is that our user didn't just give a company access to their address book; instead they've wound up infected by something that is more or less malware (and of course this means that their machine may also have other problems). And I wouldn't have found any of this if I hadn't decided to turn over this particular rock as part of writing a blog entry.
(It turns out this company has a Wikipedia entry. It's rather eyebrow raising in a 'this seems so whitewashed it's blinding' kind of way. Since it was so obviously white, I dipped into the edit history and the talk page and found both rather interesting, ie there was and may still be a roiling controversy that is not reflected in the page contents. I'm kind of sad to see Wikipedia (ab)used this way, but I'm not wading into that particular swamp for any reason.)
The cost of OmniOS not having
Systems without /etc/cron.d just make my sysadmin life harder and more annoying. OmniOS, I'm looking at you.
For those people who have not encountered it, this is a Linux cron
feature where you can basically put additional crontab files in
/etc/cron.d. To many people this may sound like a minor feature;
let me assure you it is not.
Here is why it is an important feature: it makes adding, modifying,
or deleting your crontab entries as trivial as copying a file.
It is very easy to copy files (or create them). You can trivially
script it, there are tons of tools to do this for you in various
ways and from various sources (from
rsync on up), and it is very
easy to scale file copies up for a fleet of machines.
Managing crontab entries without this is either painfully manual,
involves attempts to do reliable automated file editing through
interfaces not designed for it, or requires you to basically build
your own custom equivalent of it and then treat the system crontab
file as an implementation detail inside your
This is a real cost and it matters for us.
/etc/cron.d, adding a new custom-scheduled service on some
or all of our fileservers would be trivial
and guaranteed to not perturb anything else. Especially, adding it
to all of them is no more work than adding it to one or two (and
may even be slightly less work). With current OmniOS cron, it is
dauntingly and discouragingly difficult. We have to log in to each
fileserver, run '
crontab -e' by hand, worry about an accidental
edit mistake damaging other things, and then update our fileserver
install instructions to account for the new crontab edits. Changed
your mind and need to revise just what your crontab entry is (eg
to change when it runs)? You get to do all that all over again.
The result is that we'll do a great deal to avoid having to update
OmniOS crontabs. I actually found myself thinking about how I would
invent my own job scheduling system in central shell scripts that
we already run out of cron, just because doing that seemed like
less work and less annoyance than slogging around to run '
-e' even once (and it probably wouldn't have been just once).
(Updates to the shell scripts et al are automatically distributed to our OmniOS machines, so they're 'change once centrally and we're done'.)
Note that it's important that
/etc/cron.d supports multiple files,
because that lets you separate each crontab entry (or logically
related chunk of entries) into an independently managed thing. If
it was only one single file, multiple separate things that all
wanted crontab entries would have to coordinate updates to the file.
This would get you back to all sorts of problems, like 'can I
reliably find or remove just my entries?' and 'are my entries
theoretically there?'. With
/etc/cron.d, all you need is for
people (and systems) to pick different filenames for their particular
entries. This generally happens naturally because you get to use
descriptive names for them.
Exploring the irritating thing about Python's
Let's start out with the tweets:
AttributeError: 'list' object has no attribute 'join'
@thatcks: It's quite irritating that you can't ask lists to join themselves w/ a string, you have to ask a string to join a list with itself.
Python has some warts here and there. Not necessarily big warts, but warts that make you give it a sideways look and wonder what people were thinking. One of them is how you do the common operation of turning a sequence of strings into a single string, with the individual strings separated by some common string like ','. As we see here, a lot of people expect this to be a list operation; you ask the list 'turn yourself into a string with the following separator character'. But that's not how Python does it; instead it's a string operation where you do the odd thing of asking the separator string to assemble a list around itself. This is at least odd and some people find it bizarre. Arguably the logic is completely backwards.
There are two reasons Python wound up here. The first is that back
in the old days there was no
.join() method on strings and this
was just implemented as a function in the
string.join(). This makes perfect sense as a place to put this
operation, as it's a string-making operation. But when Python did
its great method-ization of various module functions, it of course
made most of the
string module functions into methods on the
string type, so we wound up with the current <str>.join(). Since
then it's become Python orthodoxy to invoke list to string joining
sep.join(lst)' instead of '
The other reason can be illuminated by noting that if Python did
it the other way around you wouldn't have just
also have to have
tuple.join() and in fact a
.join() method on
every sequence compatible type or even iterators. Anything that you
wanted to join together into a string this way would have to implement
.join(), which would be a lot of types even in the standard
library. And because of how both CPython and Python are structured,
a lot of this would involve re-implementation and duplication of
identical or nearly identical code. If you have to have
as a method on something, putting it on the few separator types
means that you have far less code duplication and that any new
sequence type automatically supports doing this in the correct
(I'm sure that people would write iterator or sequence types that
didn't have a
.join() method if it was possible to do so, because
sooner or later people leave out every method they don't think
they're going to use.)
Given the limitations of Python, I'll reluctantly concede that the
.join() approach is the better alternative. I don't think
you can even get away with having just
string.join() and no string
.join() method (however much an irrational bit of me would like
to throw the baby out with the bathwater here). Even ignoring
people's irritation with having to do '
import string' just to get
string.join(), there would be some CPython implementation
Sidebar: The implementation challenges
String joining is a sufficiently frequent operation that you want
it to be efficient. Doing it efficiently requires doing it in C so
that you can do tricks like pre-compute the length of the final
string, allocate all of the memory once, and then
of the pieces into place. However, you also have both byte strings
and Unicode strings, and each needs their own specialized C level
string joining implementation (especially as modern Unicode strings
have a complex internal storage structure).
string module is actually a Python level module. So
how do you go from an in-Python
string.join() function to specific
C code for byte strings or Unicode strings, depending on what you're
joining? The best mechanism CPython has for this is actually 'a
method on the C level class that the Python code can call', at which
point you're back to strings having a
.join() method under some
name. And once you have the method under some name, you might as
well expose it to Python programmers and call it
you're back to the current situation.
I may not entirely like
.join() in its current form, but I have
to admit that it's an impeccably logically assembled setup where
everything is basically the simplest and best choice I can see.
NFS writes and whether or not they're synchronous
In the original NFS v2, the situation with
writes was relatively simple. The protocol specified that the server
could only acknowledge write operations when it had committed them
to disk, both for file data writes and for metadata operations such
as creating files and directories, renaming files, and so on.
Clients were free to buffer writes locally before sending them to
the server and generally did, just as they buffered writes before
sending them to local disks. As usual, when a client program did
sync() or a
fsync(), this caused the client kernel to flush
any locally buffered writes to the server, which would then commit
them to disk and acknowledge them.
(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)
This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).
In NFS v3 the protocol was revised to add asynchronous writes and
a new operation,
COMMIT, to force the server to really flush your
submitted asynchronous writes to disk. A NFS v3 server is permitted
to lose submitted asynchronous writes up until you issue a successful
COMMIT operation; this implies that the client must hang on to a
copy of the written data so that it can resend it if needed. Of
course, the server can start writing your data earlier if it wants
to; it's up to the server. In addition clients can specify that
their writes are synchronous, reverting NFS v3 back to the v2
(See RFC 1813 for the gory details. It's actually surprisingly readable.)
In the simple case the client kernel will send a single
at the end of writing the file (for example, when your program
closes it or
fsync()s it). But if your program writes a large
enough file, the client kernel won't want to buffer all of it in
memory and so will start sending
COMMIT operations to the server
every so often so it can free up some of those write buffers. This
can cause unexpected slowdowns under some circumstances, depending on a lot of factors.
(Note that just as with other forms of writeback disk IO, the client
kernel may do these
COMMITs asynchronously from your program's
activity. Or it may opt to not try to be that clever and just force
COMMIT pause on your program every so often. There
are arguments either way.)
If you write NFS v3 file data synchronously on the client, either
O_SYNC or by appropriate NFS mount options, the client
will not just immediately send it to the server without local
buffering (the way it did in NFS v2), it will also insist that the
server write it to disk synchronously. This means that forced
synchronous client IO in NFS v3 causes a bigger change in performance
than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to
end synchronous writes. You're not just eliminating client buffering,
you're eliminating all buffering and increasing how many IOPs the
server must do (well, compared to normal NFS v3 write IO).
All of this is just for file data writes. NFS v3 metadata operations
are still just as synchronous as they were in NFS v2, so things
rm -rf' on a big source tree are just as slow as they used
(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)