2012-12-31
GNU sort's -h option
I only recently became aware of
GNU sort's -h option, which strikes me as a beautiful encapsulation of
everything (both good and bad) that people attribute to GNU programs and
their profusion of options.
GNU sort's -h is like -n (sort numerically) except that it sorts
numerically for GNU's 'humane' numbers, as produced by (for example) GNU
du's -h option. This leads naturally to a variant of a little script
that I've already talked about:
du -h | sort -hr | less
On the one hand, -h is clearly useful in both commands. Humane numbers
are a lot easier to read and grasp than plain numbers, and now GNU sort
will order them correctly for you. On the other hand you can see the
need for a -h argument to sort as evidence of an intrinsic problem
with du -h; in this view, GNU is piling hack on top of hack. The
arguable Unix way might be a general hum command that humanized all
numbers (or specific columns of numbers if you wanted); that would make
the example into 'du | sort -nr | hum | less', which creates a
general tool at the price of making people add an extra command to their
pipelines.
I don't have any particular view on whether GNU sort's -h option is
Unixly wrong or not. I do think that it's (seductively) convenient, and
now that I've become aware of it it's probably going to work its way
into various things I do.
(This could spark a great debate on what the true Unix way is, but I'm not going to touch that one right now.)
2012-12-15
A few small notes about OpenBSD PF (as of 4.4 and 5.1)
Suppose that you read the pf.conf manpage (in OpenBSD 4.4 or 5.1) and
stumble across the following:
max-src-conn <number>- Limits the maximum number of simultaneous TCP connections which have completed the 3-way handshake that a single host can make.
Great, you say, this is just what you need to make sure that bad people are not holding too many connections to your web server open at once. So you write a PF rule more or less like this:
table <BRUTES> persist
block quick log on $EXT_IF proto tcp from <BRUTES> to any port 80
pass in quick on $EXT_IF proto tcp from any to any port 80 \
keep state \
(max-src-conn 20, overload <BRUTES> flush)
Shortly after you activate this rule you may discover an
ever-increasing number of web crawler IPs listed in your BRUTES table,
which will probably surprise you. What is going on is that the OpenBSD
manpage is misleading you.
max-src-conn does not limit the number of concurrent TCP connections.
It limits the number of state table entries for TCP connections that
have been fully established. If you examine the state tables as a web
crawler is walking your site, you will discover any number of entries
sitting around in FIN_WAIT_2. These connections are thoroughly
closed but, guess what, they count against max-src-conn until they
expire completely.
An extremely technical reading of the wording of the pf.conf manpage
might lead you to a claim that this is allowed by the manpage (if you
say that a TCP connection still exists in FIN_WAIT_2), but at the
least I think this is going to surprise almost everyone. It also renders
this max-src-conn rule useless in limiting the number of concurrent
real TCP connections. Given that states linger in FIN_WAIT_2 for
on the order of a minute or more, there is no feasible setting for
max-src-conn that will allow a crawler to make one or two requests a
second without getting blocked while also giving you a useful concurrent
connections limit.
(This almost certainly applies to max-src-states too, but at least
that is explicitly documented in terms of state table entries.)
But wait, the fun isn't done yet. You decide that you really need to
limit the number of concurrent real TCP connections. You don't really
care if stray out of sequence packets from fully closed connections
get rejected by the firewall (they'd only get rejected by the host
anyways), so the obvious solution is to set a very fast timeout for
those lingering FIN_WAIT_2 states. You read the fine pf.conf
manpage again and spot some timeout settings (which can be either global
or per-state-creating-rule):
tcp.closed- The state after one endpoint sends an RST.
tcp.finwait- The state after both FINs have been exchanged and connection is closed. [...]
There is no pleasant way to put this: the pf.conf manpage is lying to
you. Setting tcp.finwait to a very low value will do exactly nothing
to help you; you need to set tcp.closed. The state timeouts are
actually:
tcp.closed |
Both sides in FIN_WAIT_2 or TIME_WAIT. |
tcp.finwait |
Both sides in CLOSING, or one side CLOSING
and the other side has progressed a bit further. |
tcp.closing |
One but not both sides in CLOSING, ie a FIN
has been sent. |
tcp.established |
Both sides ESTABLISHED. |
tcp.opening |
At least one side not ESTABLISHED yet. |
(All of this is expressed in terms of what 'pfctl -ss' will print
as the states. There are a few intermediate transient states that may
show up which I am eliding because my head hurts. See the logic in
sys/net/pf.c and the list of states in sys/netinet/tcp_fsm.h if
you really care.)
The manpage is partly technically correct in that after an RST is
sent, PF puts the state into TIME_WAIT and tcp.closed applies.
This is also the only time that a state winds up in TIME_WAIT.
(I have verified this behavior on OpenBSD 4.4. I have not verified
the behavior on OpenBSD 5.1 but the sys/net/pf.c code involved is
basically the same and reads just the same as the 4.4 version; in fact
my table above is generated by reading the 5.1 pf.c source code (and
my manpage quotes are from the 5.1 manpages). I have not looked at 5.2
source or manpages.)
2012-12-12
fork() and closing file descriptors
As I noted in Why fork() is a good API, back in the
bad old days Unix had a problem of stray file descriptors leaking from
processes into commands that they ran (for example, rsh used to gift
your shell process with any number of strays). In theory the obvious
way to solve this is to have code explicitly close all file descriptors
before it exec()s something. In practice Unix has chosen to solve this
with a special flag on file descriptors, FD_CLOEXEC, which causes
them to be automatically closed when the process exec()s.
In that entry I mentioned that there was a good reason
for this alternate solution in practice. At the start of planning this
followup entry I had a nice story all put together in my head about why
this was so, involving thread-based concurrency races. Unfortunately
that story is wrong (although a closely related concurrency race story
is the reason for things like O_CLOEXEC in Linux's open()).
FD_CLOEXEC is not necessary to deal with a concurrency race between
thread A creating a new file descriptor and thread B fork()ing
and then exec()ing in the child process, because the child's file
descriptors are frozen at the moment that it's created by fork()
(with a standard fork()). It's perfectly safe for the child process
to manually close all stray open file descriptors in user-level code,
because no matter what it does thread A can never make new file
descriptors appear in the child process partway through this. Either
they're there at the start (and will get closed by the user-level code),
or they'll never be there at all.
There are, however, several practical reasons that FD_CLOEXEC
exists. First and foremost, it proved pragmatically easier to get
code (often library code) to set FD_CLOEXEC than to get every bit
of code that did a fork() and exec() sequence to always clean up
file descriptors properly. It also means that you don't have to worry
about file descriptors being created in the child process in various
ways, especially by library code (which might be threaded code, for
extra fun). Finally, it deals with the problem that Unix has no API for
finding out what file descriptors your process has open, so your only
way of closing all stray file descriptors in user code is the brute
force approach of looping trying to close each one in turn (and on
modern Unixes, that can be a lot of potential file descriptors).
Once you have FD_CLOEXEC and programs that assume they can use it
to just fork() and exec(), you have the thread races that lead you
to needing things like O_CLOEXEC. Any time a file descriptor can
come into existence without FD_CLOEXEC being set on it, you have
a race between thread A creating the file descriptor and then setting
FD_CLOEXEC and thread B doing a fork() and exec(). If thread B
'wins' this race, it will inherit a new file descriptor that does not
have FD_CLOEXEC set and this file descriptor will leak through the
exec().
(All of this is well known in the Unix programming community that pays attention to this stuff. I'm writing it down here so that I can get it straight and firmly fixed into my head, since I almost made an embarrassing mistake about it.)
2012-12-02
What goes into the terminal's 'cbreak' and 'raw' modes
Recently, Eevee tweeted:
things i never thought i'd need to know: the difference between 'raw' and 'cbreak' is not just some flag. it's like 20! [link]
This inspires me to talk about what 'cbreak' and 'raw' modes are, both at a high level and then at the low level of exactly what terminal settings go into each mode.
The traditional 'raw' mode is the easier one to explain; it does no
in-kernel processing of input or output at all. All characters are
returned immediately when typed and all output is produced as-is. The
traditional 'cbreak' mode is used for things like password entry;
it returns characters immediately as they're typed, doesn't echo
characters, and doesn't do any in-kernel line editing (which mostly means
that your program can actually see the various editing characters).
At a high level, there are two major differences between 'cbreak' and
'raw'. First, cbreak leaves output handling unchanged, which may be
relatively 'cooked'. Second, cbreak still allows you to interrupt the
program with ^C, suspend it, and so on. You can see this in action
with most programs (such as passwd, su, or sudo) that ask for a
password; you can immediately interrupt them with ^C in a way that, eg,
vi does not respond to.
The low-level settings for cbreak are:
- disable
ECHO; this stops typed characters from being echoed. - disable
ICANON; this turns off line editing. - set
VMINto 1 andVTIMEto 0; this makes it so that aread()returns immediately once there's (at least) one character available.
This is about the minimum you can do to have anything like this so pretty much everyone is going to agree on them. The low-level settings for raw mode start with cbreak's changes and add a bunch more, but there can be some variation in exactly what settings get added; I'm going to use the Python version from eevee's link. This disables a bunch of additional tty options:
BRKINT: serial breaks are ignored and converted to null bytes. In the modern world where most ttys are pseudo-ttys instead of serial lines, this generally isn't going to make any difference.ICRNL: with this disabled, carriage returns (^J^M, '\r') are not turned into newlines (^M^J, '\n') on input (normally you can't tell them apart and both will terminate the current line).INPCK: input parity checking is disabled. Again, not an option that is relevant on pseudo-ttys.IXON: with this disabled, ^S and ^Q do not pause and then restart output.OPOST: disables any 'implementation-defined' output processing. On Linux (and probably many others) this is the setting that normally turns a newline into a CR-NL sequence.PARENB: disables parity generation on output and apparently also parity checking on input, making it overlap a bit withINPCK.IEXTEN: disables extra additional input processing and line editing characters. Things like word erase were not part of the original Unix tty line editing, so they have to be enabled separately from the basic line editing characters that are covered byICANON. It's common for extended line editing to only be enabled only if bothICANONandIEXTENare on.(Unixes vary on what effect
IEXTENhas beyond enabling the additional line editing characters. Linux pretty much only uses it for that, but Solaris does additional stuff with it.)ISIG: with this disabled, things like ^C do not generate interrupts when they're typed.
Raw mode also does stuff with CSIZE, which is unusual because it's
a mask instead of a flag; it's the set of bits in one of the fields
that are used to determine the bitsize of characters. You mask off the
CSIZE bits first and then set one of the available settings of bits;
'raw' mode sets CS8, for 8-bit characters.
(This is a little bit confusing in the Python code, which masks off the
CSIZE bits at the same time as it's disabling PARENB.)
Because Unix tty handling has a huge amount of historical baggage this collection of flags is split across four fields (input, output, 'control', and 'local'). Which field a flag is in is somewhat arbitrary and generally confusing.
(Update: as eevee notes, pretty much all
the detailed documentation you could ask for is in termios(3).)
Update, July 1st 2014: I've now noticed that I flipped ^J and ^M in my description of ICRNL. Oops. Fixed.