The problem of being overcautious
Today's fire drill was caused by our printing system not printing; since it is the first day of classes, it was not a good time to discover this. After fixing a couple of small problems, the big stumbling block was authentication not working.
Our printing system has a central machine that handles quota management and a per-lab machine that handles the actual print spooling and printing. This requires the labmasters to talk to the quota server to tell it about pages that got printed.
Because I am paranoid, the quota server insists that connections from the labmasters be somewhat authenticated (otherwise a clever student could ruin someone else's day by telling the system they'd just printed 10,000 pages). Because I am lazy, the authentication is done by the RFC 1413 'ident protocol', which gives the nominal owner of one end of a TCP connection. In this case, the quota server only accepts print updates from the user 'lp' on the labmasters.
Examining logs showed that authentication was failing because authd
(the 'ident protocol' daemon) on the labmasters wasn't returning
information about the connection. Worse, this wasn't a general
failure; if I tried it by hand, it worked. Only the quota checking
script run as part of printing provoked the authd
failures.
It took careful examination of authd
's code and a certain amount of
staring at debugging output and capturing snapshots of system files
like /proc/net/tcp
to find the problem: excessive caution.
TCP connections are uniquely identified by the quad of 'source host,
source port, destination host, destination port'. But when it reads
/proc/net/tcp
to find the right connection, authd
checks more
than that; it also requires that the state of the TCP connection be
'ESTABLISHED'. However, if you tell the kernel that you are done writing
data to the connection and will henceforth only read data, the kernel
moves your connection to the 'FIN_WAIT1' state.
The script uses a program that opens a connection, sends a line to the
other end, and then immediately tells the kernel it's done writing.
By the time the quota server program got around to asking authd
who
was making the connection, the kernel had already put the connection
into 'FIN_WAIT1' and authd
skipped over it. (When I tried by hand
I wasn't using a program that finished writing immediately, and my
connection stayed in 'ESTABLISHED'.)
I'm sure the author of authd
felt he was being careful about the
whole thing by checking the connection state as well as everything
else. However, his caution led to a problem, because his check
wasn't complete.
Every time you check something you have to be accurate and complete. The more things you check, the more work you have to do and the greater the chance that you've gotten something wrong. Thus, more checks can actually mean more bugs, instead of less.
Being complete can be difficult. For example, I'm not sure what connection states a valid TCP connection can be in in the Linux kernel, and finding out would probably require a bunch of research. (Which I could make mistakes in.)
Because of this, rather than make authd
check for FIN_WAIT1 as
well as ESTABLISHED, I just took the check out entirely.
|
|