2010-04-30
A rule of thumb: Automate where you can make mistakes
One of my sysadmin rules of thumb for deciding what to automate in scripts and programs is this: automate where you can make mistakes. In particular, automate where you are specifying redundant information.
This will make more sense with an example, so let's talk about configuring iSCSI targets. We use static target configuration, which means that you tell the system a target name and the IP address that it can be found on; each iSCSI server has two IPs, so we configure each of its targets twice, once for each IP address.
(An iSCSI server machine can have several targets, each of which has several LUNs. In our environment, each iSCSI target represents a single physical disk, with the target names divided into a per-host and a per-disk portion.)
There's an obvious redundancy here; we know for sure that server A's targets are never going to be found on any IPs besides those that belong to server A. But when we specify this redundant information by hand, we allow errors to creep in; we could accidentally configure one of A's targets with an IP address for server B. (And indeed we did this once, and it turned out to be surprisingly difficult to sort out.)
So we automated the process of adding iSCSI targets to make sure that we couldn't make this mistake, or other related ones (failing to configure each target for both IP addresses or failing to configure all of a server's targets). Our program for this now just takes the server's name; from this it can determine all of the server's targets and both IP addresses for the server, and add all 24 separate static target entries for us.
Such automation is clearly not general; instead it relies on very specific knowledge of how we always set up our systems. But since we do have strong conventions about how we set up iSCSI targets, we might as well exploit them (in a script) to avoid errors and to make our lives easier.
(And if we ever have to break our conventions, well, we can always use the underlying system commands directly. Escape hatches are important too.)
2010-04-27
Evolving our mail system step 1: adding an external mail gateway
The first change we made in the evolution of our mail system was to add a separate machine to be our external mail gateway, putting it in front of our central mail machine for outside email. We did this for two reasons: first, we wanted to introduce some sort of system-wide anti-spam features, and second, the central mail machine was heavily overloaded by directly handling external email (partly because it had a very old MTA that had an unfortunately heavyweight way of handling incoming connections and partly because it was on a very old and small machine).
Shifting the work of handling mail from the outside world to a separate machine meant that we could have a modern mailer (on modern hardware and a modern operating system) deal with all of the difficult and troublesome parts of taking email from the outside world; talking to the hordes of spam zombies, rejecting bad local addresses, and so on. It also gave us an obvious and simple place to add mail filtering, one where we didn't have to change our existing central mail machine.
In the abstract, this is a simple change; we just had to build a machine with suitable spam identification, configure it to forward all email to the central mail machine, and make it accept only valid local addresses. And this is what made life interesting, because our central mail machine was a black-box mailer configuration, so the first thing we had to do was reverse engineer enough of our existing mailer configuration so we could create a white-box mailer setup.
Once this reverse engineering had been done, we built (and tested) the actual machine and mailer configuration. We selected Ubuntu as the operating system (because we already knew we were going to move to Ubuntu Linux in general) and used Exim as the MTA (see here for why). I didn't try to fit our configuration into the whole Debian/Ubuntu 'split' Exim configuration; instead I started with the single-file Ubuntu Exim configuration and rewrote things from there, engineering in the checks and anti-spam things that we needed.
In the process of both the reverse engineering and setting up the machine, we wound up making a lot of decisions about how we'd manage our new Ubuntu Linux machines, because this was the first such machine. In turn, this required building a certain amount of infrastructure.
The actual deployment was simple; we changed MX entries from the central mail server to the new external mail gateway. We started with a few less used domains and hosts, as a final precaution that hopefully shouldn't lose too much email if something exploded, and once that went fine we made the mass change of all of our MX entries. In the process we learned some things about the limits of bridging NAT and annoying spammer behavior; in the end we had to firewall off the old central mail machine from the outside world in order to force spammers to stop talking to it.
The change had positive effects right from the start. We noticed the much smaller load on our central mail machine (which meant that it processed email much more promptly); users noticed the spam tagging we had introduced and had immediate, positive reactions to it. Overall it was a great and much-needed success that got our users to feel much more positive about our email system.
2010-04-26
How we moved from a black-box mailer configuration to a white-box one
One of the problems with our old central mail system is that it was a black-box mailer configuration; the valid local addresses and domains were defined by what the mailer accepted, and what the mailer accepted was more or less in code, not in handy configuration files. This presented an obvious problem for the evolution of our mail system.
So our first job was to reverse engineer what addresses the central
mailer accepted, and figure out how to turn these into reusable
information for the new mail machines we wanted to
build. Once the dust settled, it turned out that this was not too
difficult to machine generate, as all local addresses were equally
valid on all local host and domain names. The local addresses were
(mostly) made up from 'real' accounts in /etc/passwd, system aliases,
and simple mailing lists. Some brute force
shell scripting was able to generate a merged list of all possible local
addresses.
(We 'parsed' the system aliases with some very brute force code that took advantage of the fact that we had adopted a very rigid formatting convention for the system aliases file.)
The local domain list was harder, because a lot of it was basically
implicit in the central mailer's configuration. In the end we took the
simple approach: we generated our master list of local hosts by scanning
our DNS information to look for hosts and domains that were MX'd to
our mail handling machines. To avoid having to parse all of the format
variations allowed in BIND's DNS zone files, we did DNS zone transfers
from our master server for all of our domains (with 'dig axfr'); this
got us the data in a format that was consistent and easy to scan. This
did not quite get us all of the machines that the central mail system
would handle mail for, but it did get us the list of machines that we
wanted the central mail machine to handle mail for, so we called this
a feature.
(And without an MX, A, or CNAME record, no outside machine should ever be trying to send email for other machines to our mail machines.)
However, the difficult part of this work was not creating the code to generate the lists but reading through the mailer configuration to figure out just what addresses it accepted and how this was controlled. As a black-box mailer, our central mail machine had some dark and frustrating corners; for example, it turned out that it thought it should handle all email for machines in our domains except for a specific exemption list of machines.
(This was a sensible decision way back when, when there were no other mailers inside our domains and this saved maintaining a long list of machines and domains, but became less and less good over time as more and more machines cropped up that wanted to handle their own email.)
2010-04-24
Evolving our mail system: the overview and our goals
Way back in the beginning of time, or at least when I arrived here, we had a rather old mail system configuration that was very much an artifact of its time. In other words, it desperately needed to be modernized and replaced. And replace it we did, winding up with our current mail system.
There are two ways to describe how we planned the migration from the old system to the new one. Let me write up the nice one first:
There are at least two possible approaches you can take in a migration like this; you can go for a big bang migration, where one day you yank out the old system and drop in the new one, or you can migrate from the old system to the new one incrementally, replacing bits and pieces of the old system until it's been completely eliminated. After considering the issues, we decided to take an incremental approach to our upgrades.
(This is not a straightforward decision in general. An incremental upgrade is more work and more complexity than a big bang upgrade; you'll probably need to build a number of intermediate pieces that you'll throw away later, and you have to figure out how to wedge your new pieces into the old system with as few changes as possible.)
There were several advantages to the incremental approach for us. One is that it would let us start improving things for our users almost immediately, instead of forcing them to wait until we had completely designed, implemented, and tested all of the new mail system; since our users were unhappy with our mail system, this was quite important. Another was that a series of incremental upgrades struck us as a less risky way to tackle changing a mail system that no one really understood any more, because we'd only have to understand a limited area and get it correct for each change.
This set our overall goal and our approach: we wanted to completely migrate our mail system to a new and modernized design while having the users notice as little as possible. This dictated a fundamental approach of basically nibbling around the edges of the old system to steal bits of functionality from it without it noticing; the plan was that over time, the old system would handle less and less mail and pieces of the new one would handle more and more.
(Since the change involved an operating system and CPU architecture
change (from Solaris 8 on SPARC to Linux on x86), users that ran
programs from their .forwards would have to take action, but we didn't
want other users to have to care about the changes.)
All right, that was the nice, orderly description. The honest description is that we decided on this approach (and planned out how to steal bits of functionality from the old mail system) only after we had made our first major change. While the first change did fit in with the plan, it was made for other reasons; we were in a state where we had to do something to improve several aspects of the mail system and make our users happier with it, so we made the quickest, easiest fix that we could by introducing some spam filtering.
(At that it was not entirely quick or easy, but that's because it involved building a bunch of infrastructure.)
2010-04-14
The myth of a completely shared knowledge base across sysadmins
There is a quietly pervasive myth in multi-sysadmin groups that there can be an environment without specialists, where every sysadmin can do every job that the group does equally well. If you do much work of any complexity, this idea is not just false but crazy.
I call it crazy because of what it requires. In general, it invariably takes a significant amount of research, learning, and experimentation to become a (semi-)expert in a technical area such as Exim configuration, OpenBSD firewalls, ZFS, and so on. To say that everyone should be more or less equally good at these tasks is to say that everyone needs to invest all of that learning time for all of the technical areas that your group is involved in. Unless you are not involved in very many complex areas or you're not going very deep in them, this is a lot of time; learning even a single area can require an investment of weeks of time and effort.
If not all sysadmins invest this large amount of time, there will inevitably be specialists for each particular area. Everyone may understand Exim in general and can do straightforward stuff, but one or two people will be the only ones who've invested the time to read all of the Exim documentation and really understand how your complex configuration works, and so on.
(At the same time, it is absolutely useful for everyone to have as much knowledge as feasible of every area you work in. Speaking from personal experience, I think that non-siloed environments are better than siloed ones.)
It's tempting to think that you can pass on this knowledge rapidly by teaching your co-workers, but I don't think that this works; while you can give people an introduction, you can't expect to teach deep knowledge very fast. I can write somewhat superficial introductions to Exim all I want and wave my hands in front of a whiteboard, but for real, deep knowledge you have to read the Exim manual (and play around with configurations), just as I did. All I can do is save you from going down dead ends and maybe clarify any confusions that you wind up with. This will save some time, but I don't think it will save a lot of it.
(Exceptions to this mean that the particular area's documentation is bad, or perhaps that you are a really excellent teacher.)
2010-04-12
The importance of figuring out low-level symptoms of problems
Suppose that you have an IMAP server; it has mirrored local system disks, a bunch of memory, and a data filesystem (where the mailboxes are) in a RAID-10 array provided by a SAN. One day, it starts falling over unpredictably; the load average goes to the many hundreds, IMAP service times go into the toilet, and eventually the machine has to be force-booted. But this isn't consistent, and when it happens it happens very rapidly, going from a normal tiny load average to a load average of hundreds in a few minutes.
Believe it or not, this is a very high level and abstract description of your issue (although it may sound quite precise). But, clearly, it doesn't tell you what's wrong and what you need to do to fix things. What is not necessarily obvious until you've been through this a few times is that one of the important steps to solving things is finding out lower-level symptoms of the problem (and in the process finding out all of the lower-level things that are unrelated).
Finding lower-level symptoms has several important effects. First, it gives you good diagnostics to determine when the problem is happening. High level diagnostics are necessarily broad and unspecific, they can lag behind when the problem actually starts, and causing them to manifest can often depend on your entire production environment, making them hard to use in test scenarios.
Next, it gives you a good way to measure if you've reproduced the problem in artificial test scenarios. Anyone can drive an IMAP server into the ground with sufficient load; the trick is to be sure that you're driving it into the ground in the same way that your production environment is being driven into the ground. (Even if you are running test simulations using captured trace data, you don't know for sure.)
Finally, it gives you a good lead on tracking down why the problem is happening. Now that you know a lower-level symptom or two, you can start asking focused why questions to figure out how the symptom comes about, and it becomes sensible to dig into detailed trace data, kernel source, and so on. For example, if the time it takes to touch and remove a file is a big indicator of the problem, you can now start looking at what can make that slow.
Without those low-level symptoms, you can spend weeks going around in circles, running cross-correlations against every statistic that you can gather, trying artificial test after artificial test to see if you can reproduce something that looks like the problem, and descending to guesswork and superstition in making system changes to see if they fix the problem (eg, 'maybe adding more memory will do it').
Of course, some amount of this activity is useful to actually find those low-level symptoms, but don't lose sight of your first goal in amidst the yelling and the looking. In particular, I've come to feel that trying to do artificial reproduction of the problem is mostly a waste of time until you actually understand what the problem is; you really need a good diagnosis and a better understanding of what is really going on.
(Honesty compels me to admit that this is pretty hard to carry off when people are yelling in your ear because your IMAP server is a smoking crater in the ground on a regular basis and they need a fix now.)
Sidebar: the problem with general why questions
One of the ways of working on this overall problem is to ask why questions: why is IMAP response time slow? When someone issues an IMAP command, what operations that the IMAP server does are taking so long? Why is the load average high?
The problem with why questions is that they often either run you into
dead ends or rapidly become extremely difficult and complex to get
the answers to. The load average is high because you have hundreds of
processes in disk wait despite iostat reporting relatively normal
numbers; a single IMAP operation makes a thousand system calls, and
there's no clear pattern as to which ones take 'too long' given that the
system has a load average in the hundreds. Asking lots of questions
(and getting lots of answers) is a distraction, because it leaves you
with the job of picking through the clutter to find the important bits.
(Having said that, there is an important clue in this list of symptoms.
You could get interesting results
from asking 'so, why are processes in disk wait despite good iostat
numbers?'.)
2010-04-10
Why commands can never afford to get it wrong in a version
Netcat is a nice, handy program; there are any number of circumstances
in scripts and the like where what it does is just what I want. I don't
use it, though. Instead, I have my own simple netcat-like program
(called tcp) that I use instead.
There are a number of reasons for this, but one significant reason is that some versions of netcat get the end of file logic wrong. When they see end of file on standard input, they just close standard input; they don't signal the network server that no further input is coming. This behavior can be worked around in some but not all circumstances.
This netcat bug has since been fixed, but a buggy version of netcat made it into some Linux distributions and is still on some of the machines that I use. Since the buggy netcat exists on some machines, I can't trust netcat in general on an arbitrary machine and so I can't use netcat in any script that I want to run unedited on all of our machines; on some machines I will need to use a substitute that doesn't have the bug.
(Like many sysadmins I try to write generic scripts, especially seeing as we have a heavily NFS-based environment.)
And you know, if I've got to use a substitute on some machines it's simpler to use the substitute everywhere. I need to push around the substitute as well as the scripts, but once I've done that I have one less thing to remember, and I have a building block that works consistently everywhere.
This is why commands can't afford to get it wrong, ever. As far as sysadmins with lots of machines are concerned, if it's broken anywhere it's broken everywhere, and accidentally broken commands have a distressing habit of getting packaged and distributed on you. Once a flawed version of your command is out there in the field, well, see the deprecation schedule. I'm sure the netcat people had good intentions, but the end result is that netcat often might as well not exist for me.
(Actually, it's really too late in general once sysadmins start having
to use a substitute. I'm extremely unlikely to revise my scripts in
three or four years to start using netcat instead of my substitute,
given that everything still works fine and it would thus be pointless
make-work.)
PS: I'm aware that netcat has many, many more features than acting as a simple network copy command. But I generally don't need those features and I do periodically need a network copy command.
2010-04-08
A little script: sshup
(It's been a while since the last little script.)
One of the things I do a fair bit around here is reboot machines. Well, to be more specific, I reboot machines and then wait for them to come back up so that I can continue my testing, do more work on them, or verify that everything is fine. Because I am not crazy I do not do this in the machine room; I do it from my desk.
Waiting for machines to come up and checking periodically to see if they
have is tedious, repetitive, and boring. Like any lazy sysadmin, I long
ago automated this in a handy little script that I call sshup. Since
what I care about is when I can ssh in to newly-rebooted machines, the
script tests to see if a machine is up by checking to see if it can
connect to the machine's ssh port.
(You can do this with netcat; the version of the script that I actually use is written in rc and uses a different netcat-like utility program for reasons that don't fit in this margin.)
I generally run sshup in an xterm with zIconBeep set; start a new xterm, run sshup, iconify it, and do
something else until either the iconified xterm notifies me that the
machine is up (because sshup printed something) or I realize that too
much time has passed and go look into what's wrong. It's turned out to
be quite handy.
Here is a version of sshup in Bourne shell:
#!/bin/sh
# usage: sshup host
reachable() { nc -z $1 ssh; }
while ! reachable $1; do
sleep 15
done
echo $1 UP
(A real version would have some error checking and maybe not hard-code the sleep interval.)
2010-04-07
Our current mail system's configuration
A while back I described our old mail system's configuration. Now it's time to describe our current mail system's configuration ('current' as of April 2010, although it's been pretty stable for the past year or two).
Unlike our old mail system, we now trust NFS; we keep /var/mail on
our fileservers, along with everything
else important, and the mail machines that need to deal with it use NFS.
This has significantly simplified things.
The current email system looks like this:
- mail from the outside world comes in to our MX gateway, where it is
run through a spam checking process
and then forwarded to our central mail machine.
- our central mail machine handles all aspects of email to local
addresses; it delivers to
/var/mail(and to people's 'oldmail', which keeps a copy of all email to them for the 14 days or so), expands user.forwardsand local mailing lists, and so on. It normally delivers email directly to the outside world (using a variety of IP addresses); however, we found it necessary to forward spam-tagged email for the outside world to a separate machine for delivery.Users are now encouraged to have procmail and so on deliver directly to
/var/mailinstead of using the old special addresses that we used to use (although those addresses are still supported). - the spam-forwarding machine accepts email from the central mail machine and sends it to the outside world.
There is still a separate mail submission machine for outgoing email (whether from user PCs or our servers). As before, it routes email for our domains to the central mail machine and otherwise sends email straight to the outside world.
There is a separate IMAP/POP server; it accesses everything over NFS,
with user inboxes in the NFS-mounted /var/mail and user mail folders
stored in their home directories. We have not had any problems with NFS
locking between the IMAP server and the central mail machine.
That the MX gateway is separate from the central mail machine is
an accident of history, but I think that it simplifies the mailer
configuration for both of them. It also means that the system is more
resilient in the face of NFS fileserver problems. Since the central mail
server accesses /var/mail and user home directories, it is entirely
dependent on all of our fileservers working; by contrast, the MX gateway
is basically indifferent to NFS, since all it does with email is forward
it to the central mail server.
(All of these machines have mirrored system disks, because they do have email sitting in their local spool areas while it's in the process of being delivered or shuffled around.)