Wandering Thoughts archives

2012-12-31

GNU sort's -h option

I only recently became aware of GNU sort's -h option, which strikes me as a beautiful encapsulation of everything (both good and bad) that people attribute to GNU programs and their profusion of options.

GNU sort's -h is like -n (sort numerically) except that it sorts numerically for GNU's 'humane' numbers, as produced by (for example) GNU du's -h option. This leads naturally to a variant of a little script that I've already talked about:

du -h | sort -hr | less

On the one hand, -h is clearly useful in both commands. Humane numbers are a lot easier to read and grasp than plain numbers, and now GNU sort will order them correctly for you. On the other hand you can see the need for a -h argument to sort as evidence of an intrinsic problem with du -h; in this view, GNU is piling hack on top of hack. The arguable Unix way might be a general hum command that humanized all numbers (or specific columns of numbers if you wanted); that would make the example into 'du | sort -nr | hum | less', which creates a general tool at the price of making people add an extra command to their pipelines.

I don't have any particular view on whether GNU sort's -h option is Unixly wrong or not. I do think that it's (seductively) convenient, and now that I've become aware of it it's probably going to work its way into various things I do.

(This could spark a great debate on what the true Unix way is, but I'm not going to touch that one right now.)

GNUSortHOption written at 03:12:33; Add Comment

2012-12-15

A few small notes about OpenBSD PF (as of 4.4 and 5.1)

Suppose that you read the pf.conf manpage (in OpenBSD 4.4 or 5.1) and stumble across the following:

max-src-conn <number>
Limits the maximum number of simultaneous TCP connections which have completed the 3-way handshake that a single host can make.

Great, you say, this is just what you need to make sure that bad people are not holding too many connections to your web server open at once. So you write a PF rule more or less like this:

table <BRUTES> persist
block quick log on $EXT_IF proto tcp from <BRUTES> to any port 80
pass in quick on $EXT_IF proto tcp from any to any port 80 \
     keep state \
     (max-src-conn 20, overload <BRUTES> flush)

Shortly after you activate this rule you may discover an ever-increasing number of web crawler IPs listed in your BRUTES table, which will probably surprise you. What is going on is that the OpenBSD manpage is misleading you. max-src-conn does not limit the number of concurrent TCP connections. It limits the number of state table entries for TCP connections that have been fully established. If you examine the state tables as a web crawler is walking your site, you will discover any number of entries sitting around in FIN_WAIT_2. These connections are thoroughly closed but, guess what, they count against max-src-conn until they expire completely.

An extremely technical reading of the wording of the pf.conf manpage might lead you to a claim that this is allowed by the manpage (if you say that a TCP connection still exists in FIN_WAIT_2), but at the least I think this is going to surprise almost everyone. It also renders this max-src-conn rule useless in limiting the number of concurrent real TCP connections. Given that states linger in FIN_WAIT_2 for on the order of a minute or more, there is no feasible setting for max-src-conn that will allow a crawler to make one or two requests a second without getting blocked while also giving you a useful concurrent connections limit.

(This almost certainly applies to max-src-states too, but at least that is explicitly documented in terms of state table entries.)

But wait, the fun isn't done yet. You decide that you really need to limit the number of concurrent real TCP connections. You don't really care if stray out of sequence packets from fully closed connections get rejected by the firewall (they'd only get rejected by the host anyways), so the obvious solution is to set a very fast timeout for those lingering FIN_WAIT_2 states. You read the fine pf.conf manpage again and spot some timeout settings (which can be either global or per-state-creating-rule):

tcp.closed
The state after one endpoint sends an RST.
tcp.finwait
The state after both FINs have been exchanged and connection is closed. [...]

There is no pleasant way to put this: the pf.conf manpage is lying to you. Setting tcp.finwait to a very low value will do exactly nothing to help you; you need to set tcp.closed. The state timeouts are actually:

tcp.closed Both sides in FIN_WAIT_2 or TIME_WAIT.
tcp.finwait Both sides in CLOSING, or one side CLOSING and the other side has progressed a bit further.
tcp.closing One but not both sides in CLOSING, ie a FIN has been sent.
tcp.established Both sides ESTABLISHED.
tcp.opening At least one side not ESTABLISHED yet.

(All of this is expressed in terms of what 'pfctl -ss' will print as the states. There are a few intermediate transient states that may show up which I am eliding because my head hurts. See the logic in sys/net/pf.c and the list of states in sys/netinet/tcp_fsm.h if you really care.)

The manpage is partly technically correct in that after an RST is sent, PF puts the state into TIME_WAIT and tcp.closed applies. This is also the only time that a state winds up in TIME_WAIT.

(I have verified this behavior on OpenBSD 4.4. I have not verified the behavior on OpenBSD 5.1 but the sys/net/pf.c code involved is basically the same and reads just the same as the 4.4 version; in fact my table above is generated by reading the 5.1 pf.c source code (and my manpage quotes are from the 5.1 manpages). I have not looked at 5.2 source or manpages.)

OpenBSDPfStateBits written at 00:14:03; Add Comment

2012-12-12

fork() and closing file descriptors

As I noted in Why fork() is a good API, back in the bad old days Unix had a problem of stray file descriptors leaking from processes into commands that they ran (for example, rsh used to gift your shell process with any number of strays). In theory the obvious way to solve this is to have code explicitly close all file descriptors before it exec()s something. In practice Unix has chosen to solve this with a special flag on file descriptors, FD_CLOEXEC, which causes them to be automatically closed when the process exec()s.

In that entry I mentioned that there was a good reason for this alternate solution in practice. At the start of planning this followup entry I had a nice story all put together in my head about why this was so, involving thread-based concurrency races. Unfortunately that story is wrong (although a closely related concurrency race story is the reason for things like O_CLOEXEC in Linux's open()).

FD_CLOEXEC is not necessary to deal with a concurrency race between thread A creating a new file descriptor and thread B fork()ing and then exec()ing in the child process, because the child's file descriptors are frozen at the moment that it's created by fork() (with a standard fork()). It's perfectly safe for the child process to manually close all stray open file descriptors in user-level code, because no matter what it does thread A can never make new file descriptors appear in the child process partway through this. Either they're there at the start (and will get closed by the user-level code), or they'll never be there at all.

There are, however, several practical reasons that FD_CLOEXEC exists. First and foremost, it proved pragmatically easier to get code (often library code) to set FD_CLOEXEC than to get every bit of code that did a fork() and exec() sequence to always clean up file descriptors properly. It also means that you don't have to worry about file descriptors being created in the child process in various ways, especially by library code (which might be threaded code, for extra fun). Finally, it deals with the problem that Unix has no API for finding out what file descriptors your process has open, so your only way of closing all stray file descriptors in user code is the brute force approach of looping trying to close each one in turn (and on modern Unixes, that can be a lot of potential file descriptors).

Once you have FD_CLOEXEC and programs that assume they can use it to just fork() and exec(), you have the thread races that lead you to needing things like O_CLOEXEC. Any time a file descriptor can come into existence without FD_CLOEXEC being set on it, you have a race between thread A creating the file descriptor and then setting FD_CLOEXEC and thread B doing a fork() and exec(). If thread B 'wins' this race, it will inherit a new file descriptor that does not have FD_CLOEXEC set and this file descriptor will leak through the exec().

(All of this is well known in the Unix programming community that pays attention to this stuff. I'm writing it down here so that I can get it straight and firmly fixed into my head, since I almost made an embarrassing mistake about it.)

ForkFDsAndRaces written at 23:03:50; Add Comment

2012-12-02

What goes into the terminal's 'cbreak' and 'raw' modes

Recently, Eevee tweeted:

things i never thought i'd need to know: the difference between 'raw' and 'cbreak' is not just some flag. it's like 20! [link]

This inspires me to talk about what 'cbreak' and 'raw' modes are, both at a high level and then at the low level of exactly what terminal settings go into each mode.

The traditional 'raw' mode is the easier one to explain; it does no in-kernel processing of input or output at all. All characters are returned immediately when typed and all output is produced as-is. The traditional 'cbreak' mode is used for things like password entry; it returns characters immediately as they're typed, doesn't echo characters, and doesn't do any in-kernel line editing (which mostly means that your program can actually see the various editing characters). At a high level, there are two major differences between 'cbreak' and 'raw'. First, cbreak leaves output handling unchanged, which may be relatively 'cooked'. Second, cbreak still allows you to interrupt the program with ^C, suspend it, and so on. You can see this in action with most programs (such as passwd, su, or sudo) that ask for a password; you can immediately interrupt them with ^C in a way that, eg, vi does not respond to.

The low-level settings for cbreak are:

  • disable ECHO; this stops typed characters from being echoed.
  • disable ICANON; this turns off line editing.
  • set VMIN to 1 and VTIME to 0; this makes it so that a read() returns immediately once there's (at least) one character available.

This is about the minimum you can do to have anything like this so pretty much everyone is going to agree on them. The low-level settings for raw mode start with cbreak's changes and add a bunch more, but there can be some variation in exactly what settings get added; I'm going to use the Python version from eevee's link. This disables a bunch of additional tty options:

  • BRKINT: serial breaks are ignored and converted to null bytes. In the modern world where most ttys are pseudo-ttys instead of serial lines, this generally isn't going to make any difference.
  • ICRNL: with this disabled, carriage returns (^J ^M, '\r') are not turned into newlines (^M ^J, '\n') on input (normally you can't tell them apart and both will terminate the current line).
  • INPCK: input parity checking is disabled. Again, not an option that is relevant on pseudo-ttys.
  • IXON: with this disabled, ^S and ^Q do not pause and then restart output.
  • OPOST: disables any 'implementation-defined' output processing. On Linux (and probably many others) this is the setting that normally turns a newline into a CR-NL sequence.
  • PARENB: disables parity generation on output and apparently also parity checking on input, making it overlap a bit with INPCK.
  • IEXTEN: disables extra additional input processing and line editing characters. Things like word erase were not part of the original Unix tty line editing, so they have to be enabled separately from the basic line editing characters that are covered by ICANON. It's common for extended line editing to only be enabled only if both ICANON and IEXTEN are on.

    (Unixes vary on what effect IEXTEN has beyond enabling the additional line editing characters. Linux pretty much only uses it for that, but Solaris does additional stuff with it.)

  • ISIG: with this disabled, things like ^C do not generate interrupts when they're typed.

Raw mode also does stuff with CSIZE, which is unusual because it's a mask instead of a flag; it's the set of bits in one of the fields that are used to determine the bitsize of characters. You mask off the CSIZE bits first and then set one of the available settings of bits; 'raw' mode sets CS8, for 8-bit characters.

(This is a little bit confusing in the Python code, which masks off the CSIZE bits at the same time as it's disabling PARENB.)

Because Unix tty handling has a huge amount of historical baggage this collection of flags is split across four fields (input, output, 'control', and 'local'). Which field a flag is in is somewhat arbitrary and generally confusing.

(Update: as eevee notes, pretty much all the detailed documentation you could ask for is in termios(3).)

Update, July 1st 2014: I've now noticed that I flipped ^J and ^M in my description of ICRNL. Oops. Fixed.

CBreakAndRaw written at 01:16:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.