Wandering Thoughts archives


Multiple set matches with Linux's iptables 'ipset' extension

Recently, Luke Atchew approached me on Twitter to ask if I had any ideas for solving an ipset challenge. Suppose that you have a set of origin hosts, a set of destination servers, and you want to allow traffic from the origin hosts to the destination servers while blocking all other traffic. As Luke noted, most ipset examples, including mine, are about blocking access to or from a set of hosts; they don't have a matrix setup like Luke's.

One obvious option is a complex multi-rule setup, where you have a series of rules that try to block and accept more and more of the traffic that you want; for example, you can start out by blocking all traffic to '! -match-set Destinations dst'. This gets complex if you have other rules involved, though. Another option is one of ipset's more complicated set types, such as hash:net,net, but then you get into hassles populating and maintaining the set (since it's basically the cross product of your allowed source and destination hosts).

As Luke Atchew discovered while working on this, all of this complexity is unnecessary because you can match against multiple ipset sets in the same iptables rule. It is perfectly legitimate to write rules like:

iptables -A FORWARD -m set --match-set Locals src -m set --match-set Remotes dst -j ACCEPT
iptables -A FORWARD -m set ! --match-set Locals src -m set --match-set Remotes dst -j DROP

(These rules use different keys, here the source and destination IPs, but it's probably legitimate to have several --match-set's that reuse the same key. You might need a source IP to be in an 'allowed' set and not in a 'temporarily blocked' set, for example.)

An important note here is that you absolutely have to use two '-m set' arguments, one before each --match-set. If you leave the second one out you will get the somewhat obscure error message 'iptables v1.4.21: --match-set can be specified only once', which may mislead you into believing that this isn't supported at all. This is a really easy mistake to make because it sure smells like the second -m set is surplus; after all, you already told iptables you wanted to use the ipset extension here.

(I assume that the second '-m set' is causing the parser to start setting up a new internal ipset matching operation instead of accumulating more options for the first one.)

linux/IptablesIpsetsMultipleMatches written at 02:03:13; Add Comment


I've finally turned SELinux fully off even on my laptop

As I've mentioned before, I started out with SELinux turned on on my laptop because it's essentially a stock Fedora install and that's how Fedora defaults, and using SELinux felt virtuous. Last year I reached the end of my patience with running SELinux in enforcing mode, where it actually denies access to things; instead I switched it to permissive, where it just whines about things that it would have forbidden and then a whole complicated pile of software springs into action to tell you about these audit failures with notifications, popup dialogs and so on.

Today I gave up on that. My laptop now has SELinux disabled entirely (as my desktop machines have for years). The cause is simple: too many SELinux violations kept happening and especially too many new and different ones kept coming up. I am only willing to play whack a mole on notification alerts for so long before I stop caring entirely, and I reached that point today. The simplest and most easily reversed way to stop getting notifications about SELinux violations is to set the SELinux policy to disabled in /etc/selinux/config, so that's what I did.

It's possible that some of the problem is due to just upgrading to Fedora 22 with yum instead of, say, fedup, and perhaps it could be patched up somewhat with 'restorecon -R /'. Perhaps a wholesale reinstall would reduce it even more (at the cost of putting me through a wholesale reinstall and then figuring out how to set up my environment and my account and keys and wifi access and VPNs and so on all over again). Certainly I assume that SELinux has to work for some people on Fedora. But I no longer care. I am done with being quixotically virtuous and suffering for it.

(I originally put a rant about Fedora and SELinux here, but after thinking about it I took it out again. It's nothing I haven't said before and I can't be sure that my SELinux problems would still be there if I did absolutely everything the officially approved Fedora way. Since I'm never going to stop eg doing Fedora version updates with yum, well, that case will never apply to me.)

linux/SELinuxFinallyFullyOff written at 02:28:51; Add Comment


A Bash test limitation and the brute force way around it

Suppose that you are writing a Bash script (specifically Bash) and that you want to get a number from a command that may fail and might also return output '0' (which is a bad value here). No problem, you say, you can write this like so:

sz=$(zpool list -Hp -o size $pool)
if [ $? -ne 0 -o "$sz" -eq 0 ]; then
   echo failure

Because you are a smart person, you test that this does the right thing when it fails. Lo and behold:

$ ./scrpt badarg
./scrpt: line 3: [: : integer expression expected

At one level, this is a 'well of course'; -eq specifically requires numbers on both sides, and when the command fails it does not output a number (in fact $sz winds up empty). At another level it is very annoying, because what we want here is the common short-circuiting logical operators.

The reason we're not getting the behavior we want is that test (in the built in form in Bash) is parsing and validating its entire set of arguments before it starts determining the boolean values of the overall expression. This is not necessarily a bad idea (and test has a bunch of smart argument processing), but it is inconvenient.

(Note that Bash doesn't claim that test's -a and -o operators are short-circuiting operators. In fact the idea is relatively meaningless in the context of test, since there's relatively little to short-circuit. A quick test suggests that at least some versions of Bash check every condition, eg stat() files, even when they could skip some.)

My brute force way around this was:

if [ $? -ne 0 -o -z "$sz" ] || [ "$sz" -eq 0 ]; then

After all, [ is sort of just another program, so it's perfectly valid to chain [ invocations together with the shell's actual short circuiting logical operators. This way the second [ doesn't even get run if $sz looks bad, so it can't complain about 'integer expression expected'.

(This may not be the right way to do it. I just felt like using brute force at the time.)

PS: Given that Bash emits this error message whether you like it or not, it would be nice if it had a test operator for 'this thing is actually a number'. My current check here is a bit of a hack, as it assumes zpool emits either a number or nothing.

(Updated: minor wording clarification due to reddit, because they're right, 'return 0' is the wrong way to phrase that; I knew what I meant but I can't expect other people to.)

programming/BashTestLimitation written at 00:44:27; Add Comment


Modern *BSDs have a much better init system than I was expecting

For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what was essentially the classical BSD init system, with all of its weaknesses. They made things a little bit simpler by having things like a configuration file where you could set whether standard daemons were started or not (and what arguments they got), instead of having to hand edit your /etc/rc, but that was about the extent of their niceness. When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that I got used to (to the extent that I had anything to do with it at all).

Well, guess what. While I wasn't looking, the *BSDs have introduced a much better system called rc.d. The rc.d system is basically a lightweight version of System V init; it strips out all of the runlevels, rcN.d directories, SNN and KNN symlinks, and so on to wind up with just shell scripts in /etc/rc.d and some additional support stuff.

As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.

Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.

(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)

unix/ModernBSDInitSurprise written at 01:46:52; Add Comment


Why System V init's split scripts approach is better than classical BSD

Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.

The classical BSD approach to system init is that there are /etc/rc and /etc/rc.local shell scripts that are run on boot. All daemon starting and other boot time processing is done from one or the other. There is no special shutdown processing; to shut the machine down you just kill all of the processes (and then make a system call to actually reboot). This has the positive virtue that it's really simple, but it's got some drawbacks.

This approach works fine starting the system (orderly system shutdown was out of scope originally). It also works fine for restarting daemons, provided that your daemons are single process things that can easily be shut down with 'kill' and then restarted with more or less 'daemon &'. Initially this was the case in 4.xBSD, but as time went on and Unix vendors added complications like NFS, more and more things departed from this simple 'start a process; kill a process; start a process again' model of starting and restarting.

The moment people started to have more complicated startup and shutdown needs than 'kill' and 'daemon &', we started to have problems. Either you carefully memorized all of this stuff or you kept having to read /etc/rc to figure out what to do to restart or redo thing X. Does something need a multi-step startup? You're going to be entering those multiple steps yourself. Does something need you to kill four or five processes to shut it down properly? Get used to doing that, and don't forget one. All of this was a pain even in the best cases (which was single daemon processes that merely required the right magic command line arguments).

(In practice people not infrequently wrote their own scripts that did all of this work, then ran the scripts from /etc/rc or /etc/rc.local. But there was always a temptation to skip that step because after all your thing was so short, you could put it in directly.)

By contrast, the System V init approach of separate scripts puts that knowledge into reusable components. Need to stop or start or restart something? Just run '/etc/init.d/<whatever> <what>' and you're done. What the init.d scripts are called is small enough knowledge that you can probably keep it in your head, and if you forget it's usually easy enough to look it up with an ls.

(Separate scripts are also easier to manage than a single monolithic file.)

Of course you don't need the full complexity of System V init in order to realize these advantages. In fact, back in the long ago days when I dealt with a classical BSD init system I decided that the split scripts approach was such a big win that I was willing to manually split up /etc/rc into separate scripts just to get a rough approximation of it. The result was definitely worth the effort; it made my sysadmin life much easier.

(This manual split of much of /etc/rc is the partial init system I mentioned here.)

unix/BSDInitSingleFileWeakness written at 02:05:13; Add Comment


Thinking about people's SSD inflection points in general

What I'm calling the (or a) SSD inflection point is the point where SSDs get big enough and cheap enough for people to switch from spinning rust to SSDs. Of course this is already happening for some people some of the time, so the real question is when it's going to happen for lots of people.

(Well, my real question is when it's going to happen for me, but that's another entry.)

I don't have any answers. I don't even have any particular guesses or opinions. What I do have is an obvious observation.

For most people and most systems, the choice of HDs versus SSDs is not about absolute performance (which clearly goes to SSDs today) or absolute space (which is still by far in the hands of HDs both in terms of price per GB and how much TB you can get in N drives). Instead, unsurprisingly, it is about getting enough space and then if possible making it go faster. People can and do make tradeoffs there based on their feelings about the relative importance of more space and more speed, including ones that make their systems more complicated (like having a small, affordable SSD for speed while offloading much of your data to a slow(er) but big HD). This makes the inflection point complicated and thus the migration from HDs to SSDs is probably going to be a drawn out affair.

We've already seen one broad inflection point happen here, in good laptops; big enough SSDs have mostly displaced HDs, even though people may not have all the space they want. I doubt many laptop users would trade back, even if they have to carefully manage disk space on their laptop SSD.

My suspicion is that the next inflection point will hit when affordable SSDs become big enough to hold all of the data a typical person puts on their computer; at that point you can give people much faster computers without them really noticing any drawbacks. But I don't have any idea how much space that is today; a TB? A couple of TB? Less than a TB for many people?

(My impression is that for many people the major space consumer on home machines is computer games. I'm probably out of touch on this.)

tech/SSDInflectionPoint written at 02:21:42; Add Comment


Sometimes looking into spam for a blog entry has unexpected benefits

Today, I was all set to write an entry about how I especially hate slimy companies that gain access to people's address books. In fact I had a particular company in mind, because it's clear that they did this to one of our users recently. As part of starting to write that entry, I decided to do some due diligence research on the company involved. What I found turned out to be rather more alarming than I expected.

There are two usual run of the mill ways to steal people's address books. The 'not even sort of theft' way is to just ask people to give you their address books so you can connect them to any of their friends on your service, and then perhaps send some invitation mails yourself. The underhanded way is to persuade people to give you access to their GMail or Yahoo or whatever email account for some innocent-sounding purpose, then take a copy of their address book while you're there.

These people went the extra mile; they made a browser extension. Of course it does a lot more than just take copies of your address book and none of what it does seems particularly pleasant (at least to me). Getting a browser extension into people's browsers is probably harder than getting their address books in the usual way, but I imagine it's much more lucrative (and much more damaging).

What this means is that our user didn't just give a company access to their address book; instead they've wound up infected by something that is more or less malware (and of course this means that their machine may also have other problems). And I wouldn't have found any of this if I hadn't decided to turn over this particular rock as part of writing a blog entry.

(It turns out this company has a Wikipedia entry. It's rather eyebrow raising in a 'this seems so whitewashed it's blinding' kind of way. Since it was so obviously white, I dipped into the edit history and the talk page and found both rather interesting, ie there was and may still be a roiling controversy that is not reflected in the page contents. I'm kind of sad to see Wikipedia (ab)used this way, but I'm not wading into that particular swamp for any reason.)

spam/SpamInvestigationBenefit written at 02:05:16; Add Comment


The cost of OmniOS not having /etc/cron.d

I tweeted:

Systems without /etc/cron.d just make my sysadmin life harder and more annoying. OmniOS, I'm looking at you.

For those people who have not encountered it, this is a Linux cron feature where you can basically put additional crontab files in /etc/cron.d. To many people this may sound like a minor feature; let me assure you it is not.

Here is why it is an important feature: it makes adding, modifying, or deleting your crontab entries as trivial as copying a file. It is very easy to copy files (or create them). You can trivially script it, there are tons of tools to do this for you in various ways and from various sources (from rsync on up), and it is very easy to scale file copies up for a fleet of machines.

Managing crontab entries without this is either painfully manual, involves attempts to do reliable automated file editing through interfaces not designed for it, or requires you to basically build your own custom equivalent of it and then treat the system crontab file as an implementation detail inside your cron.d equivalent. This is a real cost and it matters for us.

With /etc/cron.d, adding a new custom-scheduled service on some or all of our fileservers would be trivial and guaranteed to not perturb anything else. Especially, adding it to all of them is no more work than adding it to one or two (and may even be slightly less work). With current OmniOS cron, it is dauntingly and discouragingly difficult. We have to log in to each fileserver, run 'crontab -e' by hand, worry about an accidental edit mistake damaging other things, and then update our fileserver install instructions to account for the new crontab edits. Changed your mind and need to revise just what your crontab entry is (eg to change when it runs)? You get to do all that all over again.

The result is that we'll do a great deal to avoid having to update OmniOS crontabs. I actually found myself thinking about how I would invent my own job scheduling system in central shell scripts that we already run out of cron, just because doing that seemed like less work and less annoyance than slogging around to run 'crontab -e' even once (and it probably wouldn't have been just once).

(Updates to the shell scripts et al are automatically distributed to our OmniOS machines, so they're 'change once centrally and we're done'.)

Note that it's important that /etc/cron.d supports multiple files, because that lets you separate each crontab entry (or logically related chunk of entries) into an independently managed thing. If it was only one single file, multiple separate things that all wanted crontab entries would have to coordinate updates to the file. This would get you back to all sorts of problems, like 'can I reliably find or remove just my entries?' and 'are my entries theoretically there?'. With /etc/cron.d, all you need is for people (and systems) to pick different filenames for their particular entries. This generally happens naturally because you get to use descriptive names for them.

solaris/NoCronDCost written at 00:51:05; Add Comment


Exploring the irritating thing about Python's .join()

Let's start out with the tweets:


AttributeError: 'list' object has no attribute 'join'


@thatcks: It's quite irritating that you can't ask lists to join themselves w/ a string, you have to ask a string to join a list with itself.

Python has some warts here and there. Not necessarily big warts, but warts that make you give it a sideways look and wonder what people were thinking. One of them is how you do the common operation of turning a sequence of strings into a single string, with the individual strings separated by some common string like ','. As we see here, a lot of people expect this to be a list operation; you ask the list 'turn yourself into a string with the following separator character'. But that's not how Python does it; instead it's a string operation where you do the odd thing of asking the separator string to assemble a list around itself. This is at least odd and some people find it bizarre. Arguably the logic is completely backwards.

There are two reasons Python wound up here. The first is that back in the old days there was no .join() method on strings and this was just implemented as a function in the string module, string.join(). This makes perfect sense as a place to put this operation, as it's a string-making operation. But when Python did its great method-ization of various module functions, it of course made most of the string module functions into methods on the string type, so we wound up with the current <str>.join(). Since then it's become Python orthodoxy to invoke list to string joining as 'sep.join(lst)' instead of 'string.join(lst, sep)'.

The other reason can be illuminated by noting that if Python did it the other way around you wouldn't have just lst.join(), you'd also have to have tuple.join() and in fact a .join() method on every sequence compatible type or even iterators. Anything that you wanted to join together into a string this way would have to implement a .join(), which would be a lot of types even in the standard library. And because of how both CPython and Python are structured, a lot of this would involve re-implementation and duplication of identical or nearly identical code. If you have to have .join() as a method on something, putting it on the few separator types means that you have far less code duplication and that any new sequence type automatically supports doing this in the correct orthodox way.

(I'm sure that people would write iterator or sequence types that didn't have a .join() method if it was possible to do so, because sooner or later people leave out every method they don't think they're going to use.)

Given the limitations of Python, I'll reluctantly concede that the current .join() approach is the better alternative. I don't think you can even get away with having just string.join() and no string .join() method (however much an irrational bit of me would like to throw the baby out with the bathwater here). Even ignoring people's irritation with having to do 'import string' just to get access to string.join(), there would be some CPython implementation challenges.

Sidebar: The implementation challenges

String joining is a sufficiently frequent operation that you want it to be efficient. Doing it efficiently requires doing it in C so that you can do tricks like pre-compute the length of the final string, allocate all of the memory once, and then memcpy() all of the pieces into place. However, you also have both byte strings and Unicode strings, and each needs their own specialized C level string joining implementation (especially as modern Unicode strings have a complex internal storage structure).

The existing string module is actually a Python level module. So how do you go from an in-Python string.join() function to specific C code for byte strings or Unicode strings, depending on what you're joining? The best mechanism CPython has for this is actually 'a method on the C level class that the Python code can call', at which point you're back to strings having a .join() method under some name. And once you have the method under some name, you might as well expose it to Python programmers and call it .join(), ie you're back to the current situation.

I may not entirely like .join() in its current form, but I have to admit that it's an impeccably logically assembled setup where everything is basically the simplest and best choice I can see.

python/JoinDesignDecisions written at 02:08:42; Add Comment


NFS writes and whether or not they're synchronous

In the original NFS v2, the situation with writes was relatively simple. The protocol specified that the server could only acknowledge write operations when it had committed them to disk, both for file data writes and for metadata operations such as creating files and directories, renaming files, and so on. Clients were free to buffer writes locally before sending them to the server and generally did, just as they buffered writes before sending them to local disks. As usual, when a client program did a sync() or a fsync(), this caused the client kernel to flush any locally buffered writes to the server, which would then commit them to disk and acknowledge them.

(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)

This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).

In NFS v3 the protocol was revised to add asynchronous writes and a new operation, COMMIT, to force the server to really flush your submitted asynchronous writes to disk. A NFS v3 server is permitted to lose submitted asynchronous writes up until you issue a successful COMMIT operation; this implies that the client must hang on to a copy of the written data so that it can resend it if needed. Of course, the server can start writing your data earlier if it wants to; it's up to the server. In addition clients can specify that their writes are synchronous, reverting NFS v3 back to the v2 behavior.

(See RFC 1813 for the gory details. It's actually surprisingly readable.)

In the simple case the client kernel will send a single COMMIT at the end of writing the file (for example, when your program closes it or fsync()s it). But if your program writes a large enough file, the client kernel won't want to buffer all of it in memory and so will start sending COMMIT operations to the server every so often so it can free up some of those write buffers. This can cause unexpected slowdowns under some circumstances, depending on a lot of factors.

(Note that just as with other forms of writeback disk IO, the client kernel may do these COMMITs asynchronously from your program's activity. Or it may opt to not try to be that clever and just force a synchronous COMMIT pause on your program every so often. There are arguments either way.)

If you write NFS v3 file data synchronously on the client, either by using O_SYNC or by appropriate NFS mount options, the client will not just immediately send it to the server without local buffering (the way it did in NFS v2), it will also insist that the server write it to disk synchronously. This means that forced synchronous client IO in NFS v3 causes a bigger change in performance than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to end synchronous writes. You're not just eliminating client buffering, you're eliminating all buffering and increasing how many IOPs the server must do (well, compared to normal NFS v3 write IO).

All of this is just for file data writes. NFS v3 metadata operations are still just as synchronous as they were in NFS v2, so things like 'rm -rf' on a big source tree are just as slow as they used to be.

(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)

unix/NFSWritesAndSync written at 00:44:42; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.