Wandering Thoughts archives

2014-10-22

Exim's (log) identifiers are basically unique on a given machine

Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?

The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.

(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)

This means that aggregation of multi-line logs can be done with simple brute force approaches that rely on ID uniqueness. Heck, to group all the log lines for a given message together you can just sort on the ID field, assuming you do a stable sort so that things stay in timestamp order when the IDs match.

(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)

PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.

EximLogIdUniqueness written at 00:20:33; Add Comment

2014-10-18

During your crisis, remember to look for anomalies

This is a war story.

Today I had one of those valuable learning experiences for a system administrator. What happened is that one of our old fileservers locked up mysteriously, so we power cycled it. Then it locked up again. And again (and an attempt to get a crash dump failed). We thought it might be hardware related, so we transplanted the system disks into an entirely new chassis (with more memory, because there was some indications that it might be running out of memory somehow). It still locked up. Each lockup took maybe ten or fifteen minutes from the reboot, and things were all the more alarming and mysterious because this particular old fileserver only had a handful of production filesystems still on it; almost all of them had been migrated to one of our new fileservers. After one more lockup we gave up and went with our panic plan: we disabled NFS and set up to do an emergency migration of the remaining filesystems to the appropriate new fileserver.

Only as we started the first filesystem migration did we notice that one of the ZFS pools was completely full (so full it could not make a ZFS snapshot). As we were freeing up some space in the pool, a little light came on in the back of my mind; I remembered reading something about how full ZFS pools on our ancient version of Solaris could be very bad news, and I was pretty sure that earlier I'd seen a bunch of NFS write IO at least being attempted against the pool. Rather than migrate the filesystem after the pool had some free space, we selectively re-enabled NFS fileservice. The fileserver stayed up. We enabled more NFS fileservice. And things stayed happy. At this point we're pretty sure that we found the actual cause of all of our fileserver problems today.

(Afterwards I discovered that we had run into something like this before.)

What this has taught me is during an inexplicable crisis, I should try to take a bit of time to look for anomalies. Not specific anomalies, but general ones; things about the state of the system that aren't right or don't seem right.

(There is a certain amount of hindsight bias in this advice, but I want to mull that over a bit before I wrote more about it. The more I think about it the more complicated real crisis response becomes.)

CrisisLookForAnomalies written at 00:54:50; Add Comment

2014-10-15

Why system administrators hate security researchers every so often

So in the wake of the Bash vulnerability I was reading this Errata Security entry on Bash's code (via due to an @0xabad1dea retweet) and I came across this:

So now that we know what's wrong, how do we fix it? The answer is to clean up the technical debt, to go through the code and make systematic changes to bring it up to 2014 standards.

This will fix a lot of bugs, but it will break existing shell-scripts that depend upon those bugs. That's not a problem -- that's what upping the major version number is for. [...]

I cannot put this gently, so here it goes: FAIL.

The likely effect of any significant amount of observable Bash behavior changes (for behavior that is not itself a security bug) will be to leave security people feeling smug and the problem completely unsolved. Sure, the resulting Bash will be more secure. A powered off computer in a vault is more secure too. What it is not is useful, and the exact same thing is true of cavalierly breaking things in the name of security.

Bash's current behavior is relied on by a great many scripts written by a great many people. If you change any significant observable part of that behavior, so that scripts start breaking, you have broken the overall system that Bash is a part of. Your change is not useful. It doesn't matter if you change Bash's version number because changing the version number does nothing to magically fix those broken scripts.

Fortunately (for sysadmins), the Bash maintainers are extremely unlikely to take changes that will cause significant breakage in scripts. Even if the Bash maintainers take them, many distribution maintainers will not take them. In fact the distributions who are most likely to not take the fixes are the distributions that most need them, ie the distributions that have Bash as /bin/sh and thus where the breakage will cause the most pain (and Bashisms in such scripts are not necessarily bugs). Hence such a version of Bash, if one is ever developed by someone, is highly likely to leave security researchers feeling smug about having fixed the problem even if people are too obstinate to pick up their fix and to leave systems no more secure than before.

But then, this is no surprise. Security researchers have been ignoring the human side of their nominal field for a long time.

(As always, social problems are the real problems. If your proposed technical solution to a security issue is not feasible in practice, you have not actually fixed the problem. As a corollary, calling for such fixes is much the same as hoping magical elves will fix the problem.)

SecurityResearcherFail written at 01:12:55; Add Comment

2014-10-14

Bashisms in #!/bin/sh scripts are not necessarily bugs

In the wake of Shellshock, any number of people have cropped up in any number of places to say that you should always be able to change a system's /bin/sh to something other than Bash because Bashisms in scripts that are specified to use #!/bin/sh are a bug. It is my heretical view that these people are wrong in general (although potentially right in specific situations).

First, let us get a trivial root out of the way: a Unix distribution is fully entitled to assume that you have not changed non-adjustable things. If a distribution ships with /bin/sh as Bash and does not have a supported way to change it to some other shell, then the distribution is fully entitled to write its own #!/bin/sh shell scripts so that they use Bashisms. This may be an unwise choice on the distribution's part, but it's not a bug unless they have an official policy that all of their shell scripts should be POSIX-only.

(Of course the distribution may act on RFEs that their #!/bin/sh scripts not use Bashisms. But that's different from it being a bug.)

Next, let's talk about user scripts. On a system where /bin/sh is always officially Bash, ordinary people are equally entitled to assume that your systems have not been manually mangled into unofficial states. As a result they are also entitled to write their #!/bin/sh scripts with Bashisms in them, because these scripts work properly on all officially supported system configurations. As with distributions, this may not be a wise choice (since it may cause pain if and when they ever move those scripts to another Unix system) but it is not a bug. The only case when it even approaches being a bug is when the distribution has officially included large warnings saying '/bin/sh is currently Bash but it may be something else someday, you should write all /bin/sh shell scripts to POSIX only, and here is a tool to help with that'.

There are some systems where this is the case and has historically been the case, and on those systems you can say that people using Bashisms in #!/bin/sh scripts clearly have a bug by the system's official policy. There are also quite a number of systems where this is or has not been the case, where the official /bin/sh is Bash and always has been. On those systems, Bashisms in #!/bin/sh scripts are not a bug.

(By the way, only relatively recently have you been able to count on /bin/sh being POSIX compatible; see here. Often it's had very few guarantees.)

By the way, as a pragmatic matter a system with only Bash as /bin/sh is likely to have plenty of /bin/sh shell scripts with Bashisms in them even if the official policy is that you should only use POSIX features in such scripts. This is a straightforward application of one of my aphorisms of system administration (and perhaps also this one). These scripts have a nominal bug, but of course people are not going to be happy if you break them.

BashAsShAndBashisms written at 02:06:38; Add Comment

2014-10-13

System metrics need to be documented, not just to exist

As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.

At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.

(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)

Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).

At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.

(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)

StatsNeedDocumentation written at 01:24:54; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.