2006-07-25
An awk idiom: getting fields backwards from the end of the line
It's easy in awk to get fields counting from the start of a line;
the first field is $1, the second field is $2, and so on. But
periodically I'm not interested in fields at the start of the line, I'm
interested in a field at the end of the line; it's much easier to see
that it's the third-last field than to carefully count how many fields
it is from the start.
(And sometimes you have a variable number of fields in a line but you know that what you want is always Nth field from the end, in which case counting up from the front doesn't help at all. One common case is when a logical field can sometimes have whitespace, so awk will turn it into a variable number of fields.)
Fortunately awk has a way out: '$' is actually an operator (a very
high precedence one), so it takes expressions as well as just numbers,
and the NF variable is the number of fields in the current line.
Because I keep forgetting this: the last field in the line is '$(NF)',
and '$(NF-1)' is the second last field. (Because awk counts fields
from 1 instead of from 0, unlike Python and Perl.)
(Okay, technically you can use the $[ magical variable to make Perl
1-based, or in fact arbitrarily based. Don't.)
Reading Unix manpages
One of the important skills for Unix programming is the ability to parse manpages carefully. This is not as easy as it looks, because manpages are traditionally written in a style where everything is important and you have to think carefully about the implications of the exact wording used.
(This can be useful for other things than Unix manpages, since a lot of specifications are written in the same style.)
For example, today I was emailed a comment on my Python
socket module irritation entry
pointing out the existence of the .makefile() method function,
which:
Return a file object associated with the socket. [...] The file object references a
dup()ped version of the socket file descriptor, so the file object and socket object may be closed or garbage-collected independently.
Thinking about how I would use this, one of the things I found myself
wondering about was what would happen if you dup()ped a socket
file descriptor and called shutdown() on only one of the file
descriptors. (Bearing in mind that you have to close() all of the
file descriptors for a socket before the socket goes away.)
So I consulted the manpage. The Linux shutdown(2) manpage contains the
following description (emphasis mine):
The shutdown call causes all or part of a full-duplex connection on the socket associated with fd to be shut down.
(Similar wording appears in the Solaris and FreeBSD manual pages.)
Once I put on my spec reading hat, it was clear that saying 'the socket
associated with fd' instead of something like 'the file descriptor
fd' was important. Thus shutdown(2) is not like close() and has
an immediate effect when called, no matter how many times the file
descriptor has been dup()ped.
(And some quick Python later, I had confirmed this.)
2006-07-18
Why you can't mix wildcard IP port binding with other bindings
A while back I wrote, about TCP port binding:
Unfortunately, you can't bind a port to the generic wildcard address if anyone is using the port on a specific IP address, hence the Apache restart problem. (Why this port binding limitation is a sensible is beyond the scope of this margin, but it is, however annoying it periodically is.)
Today I feel like explaining why. The core problem is deciding who gets handed a new connection (or for UDP, a new packet); if you allow multiple binding, all of the possible answers create security issues.
Imagine that a hypothetical system allowed two different programs
to listen for connections on port 3000, one with a wildcard address
(aka INADDR_ANY) and one specifically with the local address
127.0.0.1. There are generally three useful answers for who gets the
connection when someone connects to port 3000 on 127.0.0.1:
- the program that bound the port with the 127.0.0.1 address.
- the program that bound the port with the wildcard address.
- something different for each connection (a random choice, the first
program to call
accept(), strict alternation, etc).
All of these let a malicious program steal incoming connections from a legitimate program and do nasty things with them. The first two are mirrors of each other and each can hose you in the right circumstances, and the third just means that the malicious program steals only some of the connections instead of all of them.
The one limited way out that I can see is to make binding the second port succeed only if the process doing so is privileged, and have it forcefully close the first port. (If you don't force-close the first port, you open daemons up to subtle connection stealing attacks when they're restarted or started after boot.)
But, really, the simplest way out is to prevent the whole situation coming up in the first place, which is exactly what Unix does. (I suspect Windows does likewise, but I don't know for sure.)
(PS: the title of this entry illustrates how I sometimes can't write good titles.)
2006-07-06
More on tabs
The one thing that changing hard tabstops allows is for the same file to come out different for different people. This may explain its enduring popularity, because in theory you can use it to view source files in your favorite indentation level while letting me view them in mine. If you like two-space indents and I'm a mutant who likes 8-space ones, we can both be happy.
Of course this is only an illusion; it breaks down explosively any time you need sub-tab indentation (and any time you're using indentation to align with other text).
(Alas, this entire chain of logic is partly an illusion. I'm pretty sure
that vi's tabstop setting predates any sort of support for soft tabs,
and was created so that people who wanted to indent by four spaces could
do so in the easiest way possible.)
I'll also take this opportunity to link to Jamie Zawinski's Tabs versus Spaces: An Eternal Holy War (from here), which has a lot of sense about the whole situation and a lot less ranting than I did yesterday.
The whole issue may sound like a tempest in a teapot, but I think it's important. People spend a lot of time reading source code, and we know that whitespace is important for readability in general. If we want readable code (and we should), we need to get the whitespace right enough; it's just that no one is entirely convinced about what 'right enough' is.
(And having written this I'm forced to wonder what the typesetting wonks might have to teach us on all of this, since whitespace issues do come up when typesetting ordinary books and so on.)
2006-07-05
On tabs
I rarely have violent visceral reactions to things. But I have a few hot buttons, and here's one of them. Quoting from Joel Spolsky's summary:
Nick Gravgaard: "Rather than saying that a tab character (a "hard tab") will move the cursor until the cursor's position is a multiple of N characters, we should say that a tab character is a delimiter between table cells..."
Wrong. Wrong wrong wrong. The core wrongness is visible in a single line in the original:
The solution then is to redefine how tabs are interpreted by the text editor.
The problem is that it's not just your text editor that matters.
Actual real tab characters have a well defined meaning, enforced by
terminal emulators, pagers like less, many editors, your browser (in
<pre> text), and programs that print things out. If you decide, like
Humpty Dumpty, to make Control-I mean something else for you, you have
adopted an entirely quixotic quest and I want you nowhere near anything
I ever have to read.
(Because I certainly do not want to have to use your editor, assuming it is even available on my platform, to read your text in a way that makes it look decent and comprehensible.)
This need not cramp anyone's creativity, since I don't care what the tab key on your keyboard does and I don't care what mixture of characters your editor uses to embody your favorite indentation (provided it agrees with the rest of the world on what a Control-I character does). This leaves a great deal of room in the user interface, as GNU Emacs has been demonstrating for years.
(If you want to be able to edit Unix Makefiles, your editor had better be able to create and preserve real tab characters. Arguably this was a mistake way back in V7, but we're certainly stuck with it now.)
From this, you may gather that my opinion of vi's tabstop setting and
the corresponding GNU Emacs tab-width variable is rather low.