2012-08-30
My perspective on why we do in-place reinstalls of machines
Given what I mentioned yesterday, you might wonder why we are doing what I called 'in-place' reinstalls of machines, where we reinstall a machine with the name and IP address that it will use in production. From my perspective there are two or three reasons for this.
The first reason, the big reason, is that we've run out of spare hardware. In previous upgrades we installed the new version of a machine on completely new hardware, got it running, and then switched everything around during the 'upgrade' downtime; the old version of the machine got renamed or powered down (and was later reused for something else) and the OS install on the new hardware got itself renamed and so on. This was kind of a pain but it was worth it for genuinely fast and hassle-free switchovers. But this requires a bunch of spare servers and we've steadily used most of them up.
(Some of the 'used' servers are actually just reserved for certain future uses, but most of them actually running in production.)
The other reason or two boils down to 'we're just lazy enough to take the risks'. In theory we could (re)install the new version of a machine under a temporary name and IP address, get it almost fully up, and then switch its name, IP address, and so on to the production one(s); we could do this either on the production hardware or on another identical server and then move the disk(s) over to the production hardware. In practice it's just enough of an extra pain to install machines under temporary names and then rename them (and re-IP them, remember to give them all of the IP aliases, and so on) that we're willing to take the risks of an in-place reinstall. Using another server for the initial install and moving the disks afterwards generally adds an extra layer of pain to the process, mostly because common operating systems are increasingly binding things tightly to the specific hardware they happen to be on at the moment; when you move the disks, you get to find and fix all of these things too.
(When we don't have spare hardware, it should also be noted that a genuine in-place reinstall results in a somewhat shorter downtime and requires fewer manual steps that can blow up in our face. This is probably a reasonable tradeoff.)
It's worth noting that you should only have this issue in the kind of infrastructure we do, or at least the kind of infrastructure where people and services are talking directly to machines by name or IP address and are not indirecting through a load balancer or any other sort of directory service. If you have an indirection step, taking a machine in or out of production service should be a trivial step that's independent of installing the machine and you should effectively never have in-place reinstalls of anything except perhaps the load balancers and directory servers themselves.
2012-08-29
A realization: install configuration files before packages
We have what is probably a somewhat unusual problem. Every so often we do what I'll call an 'in-place' reinstall of a machine, where we reinstall a machine with the name and IP address that it will use in production, and as part of such reinstalls we install packages for things like mailers (if the machine is a mail machine), web servers, and so on. The problem with this is that sometimes a package will not just install the daemon but auto-start it for you. This auto-start happens with the package's default configuration file, which generally won't work right for your environment; an extreme case is a mailer, where everyone installs very conservative, no-relay configurations. If you start the mailer on one of your custom mail machines with such a conservative configuration, it will immediately start bouncing incoming email. So far we've dealt with this by install instructions that say things like 'shut down the following services on other machines' or 'before you install these additional packages, add the following IP blocks'. This has generally worked but it's always made me a little nervous about how fragile it was.
I'm slow sometimes, so it was only today that it occurred to me that there's another way. Normally we install software packages, stop the daemon, and then install our local configuration files. Well, you know, there's no reason we have to do it in that order; instead, we can install our local configuration files before installing the packages. That way even if installing the package auto-starts the daemon it will auto-start with our configuration files instead of the stock ones and things will work or at least fail gracefully, even if we slip up and don't apply the IP blocks or don't shut down everything.
Whether or not this works depends on how the packaging system you're
using behaves and whether or not the files you're pre-installing are
marked as 'configuration' files (I know this works in Debian's dpkg,
beyond that I haven't yet experimented). Hopefully every package system
is smart enough to go along with this, because this is a useful trick.
(One potential problem with this is file ownership, if installing the package also creates some users or groups. There are potential ways around this if it matters, but in general what you care about is not necessarily that the daemon works but that it doesn't hard-fail anything that it shouldn't hard-fail. Having a setup that just says 'sorry, internal problem, come back later' is generally at least as good as having a fully working one.)
2012-08-28
When Exim generates bounce messages
When I started looking at the numbers from our recent spam incident, an inconsistency jumped out at me: we had significantly more bounce messages in the queue than my count of submitted spam messages. Fortunately for my peace of mind, it turns out the cause of this wasn't missing spam messages but that Exim handles bounces differently than I was expecting.
What I was expecting is how I believe that our old mailer behaved; I was expecting Exim to generate a single bounce message at the end of fully processing a message, with all of the addresses that had had errors. In this model of bounces there's at most one genuine bounce generated for each message (plus possibly some number of 'your email has been delayed X amount of time' warning notification messages).
Exim doesn't work that way. Instead, how Exim seems to work is that it generates a bounce message every time it attempts to deliver a message and at least one destination address has an error. If you have a message with a lot of recipients (as many of the spam messages did) where some fail immediately and delivery to others is first delayed and then fails, Exim can generate a whole series of bounce messages from the same original message. Spam runs are probably especially likely to provoke this because they're the situation where remote mail servers are most likely to throttle you back on contact instead of doing anything that would give you an immediate rejection.
(It turns out that Exim logs enough information to actually see this. Some messages in the spam run generated six bounce messages.)
In combination with how Exim does retries, this can have the odd consequence that you can get bounces for addresses that are not on the final delivery list (in as much as you can be said to have a final delivery list in the Exim world). This happens if the bad address was in some expansion (a .forward, a simple mailing list, whatever), the message was not completely delivered to that expansion, and then the expansion was edited to remove the bad address (perhaps because of the bounce message).
I suspect that Exim takes this approach to bounces because it means that Exim doesn't have to save as much information from delivery attempt to delivery attempt. If an address fails it simply has to note that without having to also record why (which may include things like verbose SMTP replies from the other end). All of the verbose information about an address failure can simply be held in memory during a single delivery run, used to generate the bounce, and then implicitly discarded when the delivery process exits.
2012-08-27
You should log all successful user authentication
Here's something that I've recently had my nose smacked with:
Every place where users can authenticate over the network to your systems should log successful authentications, including the source IP address.
Every place. No exceptions. And all of the pieces (minimally user name, remote IP, and time) should be logged explicitly in one place; you should not have to piece together this information by inference from a collection of different logs, because sooner or later you will go mad from having to do this. You should do this logging even in systems that do not directly authenticate the user but delegate it to another system and then just use the authentication result.
(For example, suppose that you have a webmail system that is actually a frontend on your IMAP server and so relies on the IMAP server for authentication and mail access. The webmail system should still log all successful authentications, partly because your IMAP server logs are likely to report the source of all of the connections as being from the webmail system instead of the user's actual remote IP address.)
It's hopefully obvious why you need this everywhere, on every service (no matter how insignificant). Partly this is because if you ever have an attacker exploiting a compromised account, you'll want to know everything the attacker did. Partly this is because you can never entirely predict what service an attacker is going to find interesting and then (ab)use. Partly this is because it's easier to make sure that something important doesn't slip through the cracks if you have a policy with no exceptions; otherwise, it's all too easy to deploy something without this logging on the grounds that you need the service now and you'll put the logging in later. (Remember, later never comes.)
Logging failed authentications for existing users is optional (my personal feelings on this are heretical, so I will skip them for now). For hopefully obvious reasons, you should omit the login name if you log failed authentications for nonexistent users.
(This is likely something that experienced large-scale sysadmins know automatically and is part of best practices and so on. Sometimes I'm slow, and it's always tempting to say 'ehh, we're too small and it doesn't really matter for this service because ...' or, for that matter, just not think about the issue when you're deploying a service. Security compromises are low-probability events for most places and most services, and security in general is an overhead.)
2012-08-20
Sysadmins hate updates (more or less)
In the middle of Everybody hates Firefox updates is the following (about Firefox updates specifically):
Only after I heard from dozens of different users that the rapid release process had ruined Firefox did I finally get it through my thick skull: releasing an update is practically an act of aggression against your users. The developer perspective is "You guys are going to love this new update we've been working on!" The user perspective is "Oh god here comes another update, is there any way I can postpone the agony for a few more days?"
Yes. This. Change 'users' to 'sysadmins' and apply it to all software and you have a good snapshot of the sysadmin reaction to updates. There are pragmatic reasons for this but most sysadmins haven't deeply considered them; we just know that we don't like them. Well, we usually don't like them in practice.
What it generally comes down to is that except in rare cases, our systems are working right now. When you have a system that works now, an update is normally not a huge improvement; you have a working system now and hopefully you'll have a working system afterwards too. So most updates are a kind of make-work where you take a bunch of time and effort to wind up where you started (with the chance of winding up worse off). This is not exactly a compelling pitch.
(In theory, there are certain sorts of updates that we should like, things like updates that only fix bugs and improve performance. Sadly such updates basically don't exist in the real world; even when people try to create them, the possibility of mistakes in the update means that sysadmins have to test and qualify it.)
2012-08-16
Why I hate vendors, printers edition
(I was going to call this 'why I hate printers', but that turns out to not be quite accurate.)
Some people here need to buy a new printer, and they need it to hook into our general printing system. Our print system has what I think of as undemanding requirements; what we need is printers that will accept PostScript sent to them over the network (we would like for them to have ACLs and so on, but it isn't essential). We normally use HPs, and the people found a nice looking HP and approached us to ask if it would work with our print system.
This has turned out to be a surprisingly difficult question to answer. The printer has network support, lists itself as having 'HP postscript level 3 emulation', and it claims to require a proprietary binary driver in order to work on Linux. There are two interpretations of this. The optimistic one is that the printer will work fine if treated as a generic network Postscript printer, because that's what the specifications say it can do. The pessimistic one is that the binary driver is a big warning sign that this is actually some variant of what used to be called 'winprinters'; the printer doesn't actually natively support PostScript or anything fancy and all of that smarts is in the driver (which could be why the Postscript support is called an 'emulation').
The short way of putting this is that the combination of 'network printer', 'Postscript support', and 'required binary driver' is a contradiction; at least one of these three attributes must actually be false. Since the vendor specifications imply a contradiction, they aren't straightforwardly trustworthy; either I'm missing something or the vendor is obfuscating the true state of affairs for some reason. Unfortunately printer vendors have a long history of obfuscating their technical specifications in order to disguise things that would-be buyers don't like, one of which is 'some of these features aren't actually in the printer, they're in the driver'.
And that is why I don't like printer vendors today. They've managed to create a Schrödinger's printer, where we won't know what it can really do until and unless we buy one.
(I would say 'until someone buys one', but it's almost impossible to find real information from actual people about products in Internet searches any more. Search results are overrun by vendor pages (untrustworthy for the reasons discussed) or the pages of companies who want you to buy from them.)
(Of course the easy answer is to tell the people 'sorry, we can't guarantee that that printer will work with our print system'. This is not very satisfying because in practice it actually means 'do not buy this printer', which is a pretty strong thing to say about a printer that might well work fine.)
2012-08-05
Reasoning backwards (a story about what can happen to SATA disks)
This is a story, presented as a narrative.
Oh dear, one backup server hasn't finished yet despite running all night and all (work) day; in fact, Amanda is still doing three initial full backups of three filesystems on the single fileserver this backup server handles, and hasn't even started anything else. First thought: is there anything obviously wrong on the fileserver or the iSCSI backends? The answer is no; load is low, IO stats are low, the disks do not seem to be at all close to saturation in things IO operations a second, network usage is low on the fileserver and the backends, and there is nothing reported in syslog for any of the machines.
It's time to reason backwards and use brute force.
- Use
trusson one of thetarprocesses that Amanda is running on the fileserver to see if anything it's doing is slow. Nothing obvious jumps out (and in particular it seems to be reading data from the disks fine), but every so often one of thewrite()s it does stalls for two seconds or so.Well, why? I know that
taris writing through a pipe to a localamandadprocess (which then forwards the data over the network to Amanda on the backup server). Time to move a step backwards. - Truss the
amandadprocess to see what's slowing it down; the answer is one of itswrite()to the network stalls every so often, just liketar'swrite()does. I know the fileserver's network isn't saturated, but that's only one side of the traffic flow. - check the network usage on the backup server. Nope, there's no peculiar source of traffic volume that would be saturating it. For a bit I was wondering if the network link had come up at 100 Mbits instead of a gigabit, but there are periodic usage spikes of over 10 Mbytes/sec so that can't be it. So it has to be something local on the backup server.
Amanda on the backup server writes all incoming backups to a holding disk before writing them to 'tape' (our tapes are disks, but our version of Amanda is still strongly living in the tape world).
- check the IO stats on the holding disk and they are through the roof and well into crazy land. The big problem is that the disk is only writing data at around 5.2 Mbytes/sec (and 13 IO operations a second), despite always having over a hundred writes queued. As expected, utilization is at 100% and write service times are immense (over 8 seconds).
Ding. We have a winner (well, a loser).
As it happens we've seen SATA disks fail this way before; they get (very) slow but don't report any errors or SMART failures. Replacing the holding disk with a new one made the backup server happy with life again.
What I found most interesting about this problem was how indirect the symptoms were (in a sense). 'Really slow backups' is usually something caused by fileserver problems, not something quiet on the backup server itself. I had to reason backwards through a number of layers to arrive at the real culprit (and it helped a lot that we'd already had experience with clearly slow SATA disks).