2008-07-31
A crude system verification method
Suppose that you have a system that you are not entirely confidant of, and you want to look to see if bits of it have been modified from stock. The easiest way is to use your packaging system's verification support, but let us suppose that your package system doesn't have support for this (or at least that the support is optional and not installed at the moment).
If you happen to have another theoretically identical system lying
around (as we do), you can do a crude system verification with rsync:
rsync -n -a --delete -IOc root@hostA:/usr/ /usr/
Here hostA should be the machine that you want to verify, not the
machine that you want to verify it against. It also assumes that you
can do ssh root logins to hostA.
Some of these options are not obvious; -O makes rsync ignore changed
directory times, while -I and -c forces rsync to always checksum
files to check to see if they're different, instead of trusting the size
and the timestamp.
(Package systems generally don't reset the directory modification time
when they update programs in a directory, so directories like /usr/bin
can naturally have different timestamps on different machines. Ignoring
them saves you from drowning in noise.)
This isn't likely to work on Linux machines that use prelinking, because prelinking can create different binaries even on machines with identical package sets.
Disclaimer: as a crude verification method, this should only be used if you are mostly confidant in the system to start with. If you are not, remember the zeroth law of compromised systems.
2008-07-20
Thinking about uses for (system) activity tracers
System activity tracers are a hot topic, with the best known one being Sun's DTrace. In thinking about this issue recently, I believe that there are three sorts of questions that they can be used to answer, or at least that I'm interested in having answered:
- what is my system doing?
Performance related tracing is one obvious subset of this, both in the 'what is taking all the time' sense and in the 'how long does some operation take' sense.
- why is my system doing X, in the sense of 'what is doing X on my
system'?
Here you have some peculiar thing happening on your system and you want to trace it back to the program or system or action that causes it. For example, laptop people are interested in questions like 'what is accessing my hard drive' and 'what is waking up all the time'.
- why is some part of my system doing what it is, or at least what information is it using to make the decisions about what it does?
The latter is important for solving specific problems; often you know roughly what is going wrong and what program is responsible, but you don't know why and how it is going wrong because you can't see the program's decision making process or even the information it is getting to make the decision. For example, consider 'I can't NFS-mount a filesystem that I think I should be able to'.
In theory you could deal with this by having programs optionally log a lot of information. My personal feeling (partly from having dealt with programs that did copious logging if asked) is that it is better to have a single central interface for deciding what you want to watch and log than to try to give every program options to control all of this; it just scales better, and it's probably easier for program authors too (since they just have to make some hooks available, instead of building a dynamically reconfigurable debug logging system).
2008-07-11
The case of the mysteriously failing connections
One of the strange networking mysteries around here is that every so often, one of our login servers will report that outgoing mail was delayed because it could not connect to the mail server's SMTP port. There's several things that make this puzzling:
- the connection is failing with 'host not reachable' errors, not 'connection refused' or the like
- the mail server is up, running fine, and not loaded at all
- the login servers and the mail server are on the same subnet, although they are not connected to the same switch.
This happens very infrequently, and every time we've seen it happen it's gone away when the mailer retried a bit later (which is one reason we haven't worried about it more).
Like the last mystery I don't have any answers, but I do have a theory. First, the background: our login servers are all on a single switch, along with our compute servers. We know that during periods of high activity the switch is sending 'stop transmitting' Ethernet flow control frames to the login servers; we believe that the switch's uplink is saturated, since it's only got a gigabit uplink and is connecting eight or nine actively used machines that get all the important filesystems over NFS.
(We actually split the machines between two switches moderately recently; I don't know if we've seen the problem since then.)
So my theory is that during periods of high network activity when the switch is choked, the login server's ARP requests for the mail server's Ethernet address are getting dropped (either by the switch or by the login server's network driver). Linux does report 'host unreachable' if there's no answer to its ARP queries, and people send email from the login servers sufficiently infrequently that the necessary information could drop out of the local ARP cache.
2008-07-06
A small drawback to Wietse Venema's TCP Wrappers
Wietse Venema's tcpwrappers is mostly used for controlling access to services run from inetd or your local equivalent, where you are not expecting high performance or high load. However, it can be built as a library that you link your daemon against, and some daemons are.
(At least on Linux, both OpenSSH and the portmapper are built this way.)
It turns out that there is a small drawback to using tcpwrappers this way in some sorts of high-performance applications, specifically in application where you expect a lot of connections or want to be able to dispatch connections very fast. The drawback is this:
Tcpwrappers does no caching of the
hosts.allowandhosts.denyfiles.
Every time you call the tcpwrappers routines to check for host access, they open, read, and parse the files completely from scratch. If your files are small, this doesn't matter, but if they're large, you may be burning more CPU time on this than you expect.
(You're very unlikely to be hit with disk IO for reading the files; if you're getting any sort of connection volume they'll be in the filesystem cache.)
One useful thing to know for best performance is that tcpwrappers deals with the files strictly a line at a time (instead of parsing the entire file, then evaluating the parsed rules). This implies that it's worth putting the rules for the most common cases first if you have big files.
(Big hosts.allow and hosts.deny are probably uncommon, but I once
had a hosts.deny file that was over 4,000 lines long. That was
admittedly a special case, and eventually got replaced with better
technology.)
2008-07-03
Why system administrators like interpreted languages
Or, more specifically, why sysadmins like programs written in interpreted languages. I say this because we do; it is one reason for the enduring popularity of writing things in the Bourne shell, because in a sense the Bourne shell is the platonic ideal of an interpreted language on Unix systems.
Here's why I at least really like such programs:
- you don't need a different copy for every different sort of system that you have, because
- you don't have to compile your programs.
- thus there is no bootstrapping required on new systems; you can simply grab a copy of the program and run.
- also, that means that you don't have to keep the source somewhere; the program is the source and can't get lost.
Strictly speaking, some of these aren't advantages of writing in interpreted languages, they're advantages of using self-contained programs. You can certainly write large programs in interpreted languages that require being installed before they can run (and that have large lists of fragile dependencies).
(However, some language environments make it easy to skip the
installation step, at least on a temporary basis, with features such
as search paths that include the current directory by default. And
these days any sensible system should make copying over a directory as
easy as copying over a file, because everyone should ship with a
version of rsync, right? Yes, I'm looking at you, Sun.)
This also explains why I would generally rather use a hacked together and limited shell script than a much nicer C program; the shell script is a lot less hassle for casual use.
Sidebar: the source code advantage (again)
Take it from me: the last advantage is a big one. When the program is its own source, not only can it never get lost but there is never any question about which version of the source was used to build the binary (and with what compile options and so on).
(And let's not get started about old source that won't build in a modern environment, leaving you with a bunch of work to do if the old binaries stop working someday.)
This isn't to say that interpreted languages make portability issues go away entirely; of course they don't (and sometimes shell scripts make it worse). But they at least make it obvious, because your program blows up right away.
(The more technical way to put it is that with interpreted programs, you don't have any difference between source compatibility and binary compatibility. Compiled programs can preserve binary compatibility while breaking source compatibility.)