2008-08-31
There is a balance between optimism and paranoia for compromised machines
There is an important proviso for the first principle of analyzing compromised machines: in practice, most attacks are not that good or that thorough. In real life, as opposed to mathematically correct security advice, there is a tradeoff between your caution level and the amount of work you have to do (either in analyzing a machine or in reinstalling it) and sometimes it is appropriate to take some risk in exchange for doing less work.
The truly paranoid will reinstall machines from scratch (and never mind the disruption) if there is a chance of system compromise, such as you know that an account has been compromised. You do this because you make the most cautious assumptions possible; assume that the attacker knows some way of getting root that is undetected and unpatched, you assume that they were sophisticated and successfully hidden their traces, and so on. Even if you detect something, you can never assume that you have detected everything.
(At this level of paranoia, you should probably be using two factor authentication with smartcards and randomly reinstalling machines every so often, just in case. I suspect that few people are this paranoid.)
In my world it is often appropriate to be more optimistic about the lack of competence about our attackers (especially if we have detected the compromise in some obvious way, such as noticing a password cracker eating CPU on a login server). So I wind up doing things like simple system verification, and then conclude that the absence of evidence is evidence of absence (to put it one way), despite the zeroth law.
(This is another of the fuzzy compromises necessary for real computer security.)
2008-08-27
How I think about how important security updates are
Probably like many places, we weigh the potential or actual disruption of things like kernel updates against the risks of not updating when deciding how urgent applying them is. As part of this, I have developed a personal way of sorting security issues into different categories that I care about, with an end result ranging from not so bad to really bad (and if we are lucky, 'no impact, we can ignore').
For various reasons I feel like writing out the things I care about and look at today:
- is this a local only issue, or is it remotely exploitable?
- what's the consequence of the vulnerability?
What I usually care about is an escalating scale of denial of service (crash programs, lock the kernel, delete files), give access to files or information (for example, by being able to read kernel memory), or giving you root or other elevated privileges.
(For a remote attack, 'gives access' counts as elevated privileges.)
- what component is the vulnerability in?
Especially in something like the Linux kernel, a lot of the issues are in drivers and code that we don't come anywhere near, so they don't affect us at all.
(Note that this applies to more than just security issues; I tend to evaluate all bugs this way if the update is disruptive or potentially risky.)
Note that you should not get complacent about 'local only' security issues, because if you have a substantial user population you should just cut to the chase and assume that any attackers can have the keys to some of your local accounts. Plus, a local security issue makes a great way to leverage a small vulnerability in some network service into a huge gaping hole in your system.
(I've been there and dealt with that, and it was no fun at all.)
Also, it's important to understand that deciding not to upgrade is a risky decision for more reasons than the obvious. There have been any number of security vulnerabilities that turned out to be more exploitable than was initially believed.
2008-08-25
Fixing low command error distances
Suppose that you have a command with an unnervingly low error distance, either because a vendor stuck you with it or because it's the natural way to structure the command's arguments. The way to fix this is to change the sort of error required to make a mistake, so that you move from a likely change to an unlikely one.
(If you are working with a vendor command, you will need to do this with some sort of a cover script or program. If you are working with a local command, you can just change the arguments directly.)
For a concrete example, lets look at the ZFS zpool command to add a
spare disk to a ZFS pool: 'zpool add POOL spare DEVICE'. Much like
adding mirrors to a ZFS pool, this is one omitted word away from a
potential disaster. The simple fix in a cover script is to change it
to a separate command, making it something like 'sanpool spare POOL
DEVICE'; this changes the error distance from an omitted word to a
changed word, a less likely mistake (especially because the word you'd
have to change is in a sense the focus of what you're doing).
To make the change less likely, modify the command that the cover script uses to expand a ZFS pool; instead of using 'add' (which is general but raises the question of 'add what?'), use 'grow'. Contrast:
sanpool grow POOL DEVICE sanpool spare POOL DEVICE
Now the commands are fairly strongly distinct and harder to substitute for each other, because it is a much bigger mental distance from 'add a spare' to 'grow the pool' than from 'add a spare' to 'add a device'.
(When trying to prevent errors, it is useful to approach the commands from a high level view of what people are trying to do rather than look for low-level similarities in how it gets done. In a sense the way to avoid errors is to avoid similarities in things that are actually different.)
Another other way for cover scripts to help you avoid errors is to
just not allow them to start with. System commands may have to be
general and thus allow even the questionable, but your scripts can
be more restrictive; for example, if you know you should never have
non-redundant ZFS pool devices, you can just make 'sanpool grow'
require that.
The concept of error distance in sysadmin commands
I have recently started thinking about the concept of what I will call the 'error distance' of sysadmin commands: how much do you have to change a perfectly normal command in order to do something undesirable or disastrous (instead of just failing with an error)?
(As an example, consider the ZFS command to expand a ZFS pool with
a new pair of mirrored disks, which is 'zpool add POOL mirror DEV1
DEV2'. If you accidentally omit the 'mirror', you will add two
unmirrored disks to the ZFS pool, and you can't shrink ZFS pools to
remove devices. So the error distance here is one omitted word.)
You want the error distance for commands to be as large as possible, because this avoids accidents when people make their inevitable errors. Low error distance is also more dangerous in commonly used commands than uncommonly used ones, because you are less likely to carefully check a command that you use routinely (especially if you don't consider it inherently dangerous).
When considering the error distance, my belief is that certain sorts of changes are more likely than others (and thus make the error distance closer). My gut says:
- omitting words is more likely than changing words (using 'cat' when
you mean 'dog'), which in turn is more likely than adding words.
(I am not sure where transposing words should fit in, where you write 'cat dog' instead of 'dog cat'.)
- commonly used things are more likely than uncommon things; for example, if you commonly add an option to one command, you are more likely to add it to another command.
(I suspect that this has been studied formally at some point, probably by the HCI/Human Factors people.)
2008-08-12
The first principle of analyzing compromised machines
The first principle of analyzing compromised machines is simple:
You can't trust anything running on a compromised machine.
If you are sufficiently paranoid, everything on a compromised machine might be compromised and lying to you: the programs, the shared libraries, the data files, even the kernel. You cannot trust any of it. This means that you cannot use a compromised machine to examine itself. Not even a little bit, however convenient it would be.
(The partial exception is that if a test on the compromised machine says that something is compromised, it probably is. But don't stop there.)
If you need to examine a thoroughly compromised machine you must do it completely from the outside, using some sort of environment that is known beforehand to be good. The traditional way is to connect the disks up to another system and use it to poke around; these days you can also use a live CD, if you are sure that the system is actually running it. Only then can you be confidant that you are not being lied to and your tools are actually reading what is really there and so aren't being fooled when they say that something isn't compromised.
(This sort of paranoia can run very deep. For example, are you sure that the BIOS hasn't been compromised to boot a hypervisor that then boots your live CD in a fake environment to hide bits of the disk from it? Really, it's simpler to add the disks to another system.)
The time and effort required for this sort of complete external verification is why the common advice for compromised machines is to just reinstall them from scratch.
(And you have to perform a complete verification, because an attacker might have hidden their compromise anywhere. For example, your kernel could be intact on the disk but the attacker has added a startup script that dynamically installs their rootkit through a kernel bug.)
2008-08-03
A performance gotcha with syslogd
Stated simply: many versions of syslogd will fsync() logfiles after
writing a messages to them, in an attempt to make sure that the message
makes it to disk in case something happens to the system immediately
afterwards (crashes, loses power, etc). This obviously can have an
impact (sometimes a significant one) on any other IO activity going on
at the time.
On some but not all systems with this feature, you can this off for
specific syslog files by sticking a '-' in front of them; this is
especially handy for high volume, low importance log files, such as
ones you're just using for statistical analysis. (For example, one
system around here has a relatively active nameserver that syslogs
every query. You can bet that we have fsync() turned off for that
logfile, and when we accidentally didn't we noticed right away.)
(From the moderate amount of poking I've done, Solaris always does this
and has no option to turn it off, FreeBSD only does this for kernel
messages and can turn it off, and Linux's traditional syslog daemon
always does this and can turn it off. I don't know about the new syslog
daemon in Fedora. OpenBSD doesn't say anything in its manpages, but
appears to always fsync().)
As a side note, if you really need syslog messages to be captured, I recommend also forwarding them to a remote syslog server. That way you have a much higher chance of capturing messages like 'inconsistency detected in /var, turning it read-only' (which has happened to us), and you have a certain amount of insurance against the clock on the machine going crazy.
(A central syslog server is also a convenient place to watch all of your systems at once and easily correlate events across them.)