Wandering Thoughts archives

2013-04-25

How SuS probably requires the 'run at least once' xargs behavior

A commentator left a long comment on my entry about how xargs behaves with no input arguing that the Single Unix Specification for xargs actually requires it to not run if standard input is empty. I think it's more likely to be the other way around, so today I want to run down why I think the SuS probably requires this annoying behavior.

There are two important sections of the SuS xargs specification here and I'm going to quote both, bolding important bits:

The xargs utility shall construct a command line consisting of the utility and argument operands specified followed by as many arguments read in sequence from standard input as fit in length and number constraints specified by the options. The xargs utility shall then invoke the constructed command line and wait for its completion. This sequence shall be repeated until one of the following occurs:

  • An end-of-file condition is detected on standard input.

[... other conditions elided ...]

[...] The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. [...]

Now we get to play the fun game of interpreting standards. The easiest place to play this game with is the last sentence I quoted, which says both that the utility shall be executed at least once and that this happens until end-of-file is reached. If end of file is reached immediately, which takes precedence? In the style of reading standards that I've absorbed, explicit statements generally trump implications; that would mean that the explicit promise that utility shall be executed at least once trumps the potential implication of not running it on immediate EOF.

The first paragraph as a whole offers a similar conflict. It is easy to read it as a series of steps: first read in as many arguments as you can that fit, then run the command, and only then check for exit conditions and repeat if they are not met. You don't check for exit conditions before you run the command once because that's not what the series of steps tells you to do, and 'zero arguments' is not ruled out as a valid number of arguments to read from standard input; ergo, xargs runs the command line once even on immediate EOF. You can also read it as a general description instead of a series of steps, with the 'this sequence shall be repeated until ...' forming the framing procedure around the specific two steps used to form and run each command line; in this reading it's correct to run zero times if there is an immediate end of file on standard input since the framing loop's exit condition has been met.

If we read the first paragraph using an 'explicit trumps implicit' rule then I think that we have to conclude that the paragraph is the set of steps that xargs is intended to follow as it executes because this is exactly how the paragraph is written. This interpretation is reinforced by the 'once or more' language in the later paragraph.

None of this is unambiguous; the SuS specification never comes out and says outright 'xargs runs once even if it reads no arguments'. But given how much the usual extremely legalistic, 'every word and phrase and ordering decision counts' approach to reading standards pushes us towards the 'xargs runs once on EOF' interpretation, I think it's probably what SuS actually requires.

(Note that none of this matters in practice. As covered in the first entry, existing systems have no common behavior. The closest you can get is to always specify -r so that xargs does not run once, which works on GNU findutils, sufficiently recent FreeBSD, and OpenBSD.)

PS: this is not the most crazy thing in the SuS xargs specification. If you care about xargs portability and want to be horrified, read the description of -E carefully.

(Also, these crazy things are almost certainly not the fault of the SuS authors.)

XargsZeroArgsIssueII written at 01:02:16; Add Comment

2013-04-10

Some important things about OpenBSD PF's max-* options

In older versions of the OpenBSD pf.conf manpage (such as the one you may be running on a firewall that is too important to reboot, much less put through a chancy upgrade), the 'max <number>' option for stateful tracking is described this way:

Limits the number of concurrent states the rule may create. When this limit is reached, further packets that would create state will not match this rule until existing states time out.

This is, how shall I put it, a lie (as before). In current versions of pf.conf this phrasing has been declared inoperative and revised to be 'further packets that would create state are dropped until existing states time out'. This new phrasing is correct as far as it goes but it leaves several important things out.

First this also applies to all of the max-* variants (max-src-nodes, max-src-states, max-src-conn, and max-src-conn-rate), which you could maybe deduce because the manpage doesn't say anything about what they do when the limit is hit so clearly they inherit max's behavior (this is the way of Unix manpages).

Next, how things are logged (if they are logged) depends on your OpenBSD version. In OpenBSD 4.4, this dropping is completely silent and in fact happens after the point where packets are logged (so if you specify log on such a rule, what it logs will be sometimes be a lie; it will claim that packets are accepted when they were in fact dropped). Because overload <table> is only used (or allowed) for the TCP connection limits, this means that there is essentially no way to tell when a UDP ratelimit (perhaps one to limit traffic to your DNS server) has triggered or what it affects.

(You can watch some pfctl -si counters tick up. This is not very much use if you want to know what your ratelimit is affecting and whether it is too small, too big, or just right.)

In OpenBSD 5.2 the logs are now honest, as far as I can tell from the kernel source code (I don't have a handy 5.2-based firewall where I can test this). The logs will now accurately record both that a packet was dropped and that it was dropped due to connection limits. There is still no way to log just dropped packets but at least you can now log all traffic and sort out the mess later (assuming that your logs do not explode from the volume).

(The overload <table> clause still only applies to TCP connections. As far as I can tell this is a completely artificial limitation in PF and I personally think it's a stupid one. I would certainly like to be able to automatically put IPs that are hammering on our DNS server with UDP queries into a table to be blocked wholesale for a while.)

My overall conclusion from my recent experiences (this included) is that OpenBSD PF is not very good for UDP ratelimiting. For instance, actual volume per time limits can only be constructed indirectly and only work for some UDP-based protocols (and, I think, often only for cooperative clients).

(I'm not completely sure how OpenBSD matches states for UDP packets, but I have a sneaking suspicion that a DDoS program that reused the same UDP source port for all its forged DNS queries would match an existing PF UDP state table entry and so never hit PF's state table entry based rate limits. You can't play this trick with TCP connections because they have actual connection state.)

OpenBSDPfMaxNotes written at 00:15:53; Add Comment

2013-04-04

An irritating OpenBSD PF limitation on redirections

I am generally fond of OpenBSD's PF packet filter but every so often I run across a seemingly arbitrary limitation that drives me up the wall. Today's limitation is on where you can redirect packets to as part of NAT'ing and general address translation. I'll start by sketching out a simplified version of the problem I'm trying to solve.

Part of our complex networking setup is a scheme where specific internal machines, sitting on 'sandbox' subnets in private address space, can be reached by the outside world through public IP addresses that sit on what is effectively a virtual subnet. Through a complex dance involving two firewalls, these machines are bidirectionally NAT'd to their public IPs when they talk to the outside world. Our problem is that sometimes internal machines try to use the public IPs, and we'd like to make that work. What we want to do is conceptually simple: when a packet from the internal network and to the public IP shows up on the sandbox firewall, it should be rewritten to the internal IP instead and put back on the internal network. Something like, in pf-ese:

pass in quick on $int_if from <int_lan> to $PUBIP rdr-to $INTIP route-to $int_if

(It's not necessary to rewrite the source address and in fact it's a feature to not do so. Update: as covered in comments, it may be necessary to rewrite the source address to force return traffic to flow through the firewall to be fixed up.)

As it happens, OpenBSD PF is specifically documented (in the pf.conf manpage) to not allow this:

Redirections cannot reflect packets back through the interface they arrive on, they can only be redirected to hosts connected to different interfaces or to the firewall itself.

In the fine OpenBSD tradition this is in fact not completely true. The specific LAN segment that is $int_if actually has two separate subnets on it for historical reasons and machines on the other subnet can talk to $INTIP through this rdr-to rule without problems. It's only machines on the same subnet that can't (and not because PF blocks the packets; I've checked).

What I assume is happening is that PF and OpenBSD's routing stack are interacting badly. Under normal circumstances a router will not route a packet from host A on a subnet to host B on the same subnet (at most it will send an ICMP redirect). In an ideal world PF would be able to bypass this restriction when it rdr-to's something, especially with an explicit route-to (in my books, route-to should mean 'shut up and send the packet out that interface no matter what'). In this world PF apparently can't, which is an irritating limitation that gets in the way of what I maintain is a perfectly sensible thing to want to do.

(There are any number of cases where you might want to redirect traffic nominally to the outside world back to an internal machine.)

PS: as the pf.conf manpage notes, theoretically the way around this is to add NAT'ing with a pass out rule. I was unable to get this to work when I tried it but I might have been using options that were slightly wrong. I assume that this NAT'ing process is enough to fool the routing system into accepting the packet as something that could be validly routed.

(On the other hand, if 'pass out' is applied after routing is done I don't see how this can work. It would make sense for it to be a post-routing action, since routing is what normally decides the outbound interface, but the pf.conf manpage doesn't document whether this is the case or whether some deep magic is happening.)

OpenBSDPfRedirIssue written at 02:54:24; Add Comment

2013-04-02

Why listen(2)'s backlog parameter has such an odd meaning

In light of what the listen(2) backlog parameter actually means, ie very little, you might sensibly wonder why it has such an odd and basically useless definition. For instance, it might be quite useful to be able to put a real limit on the number of TCP connections that have been fully established but not accept()'d by your server.

The simple answer is that the listen(2) backlog is not about helping your program out, it is about limiting how many kernel resources can be (invisibly) consumed by connections that haven't yet been accept()'d and surfaced in your program (and limited, if only by file descriptor limits). Such connections are basically invisible to normal tools, programs, and Unix limits because they haven't yet been materialized as file descriptors and exist only in the depths of the kernel socket stack. This means that the kernel needs to limit them somehow. In theory the kernel could have applied the same limit to everything and not provided any way for applications to change it. In practice, I suspect that the early developers of BSD wanted to allow a way for some select daemons they expected to be unusually active to raise the normal limit (and perhaps for very inactive daemons to lower it to save kernel memory).

This leads to a simple but not particularly useful rule for what the listen(2) backlog actually limits: anything that your kernel thinks uses enough resources to care about. And this has changed over time. As kernels have found clever ways to handle various things that have traditionally consumed resources (such as half-open TCP connections), they've stopped counting against the backlog limits. Some of this evolution has been driven by necessity (such as people on the Internet exploiting half-open TCP connections as one of the first denial of service attacks) and some of it has simply been driven by the cleverness of kernel programmers. This has led to the current situation where understanding the effects of any specific backlog requires knowing something about the kernel implementation of the specific socket type involved and what things in it do and don't use up kernel resources.

(See also Derrick Petzold's somaxconn - That pesky limit, which has some interesting quotes from Stevens' Unix Network Programming.)

WhyListenBacklog written at 01:04:29; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.