2016-04-25
Why you mostly don't want to do in-place Linux version upgrades
I mentioned yesterday that we don't do in-place distribution upgrades, eg to go from Ubuntu 12.04 to 14.04; instead we rebuild starting from scratch. It's my view that in-place upgrades of at least common Linux distributions are often a bad idea for a server fleet even when they're supported. I have three reasons for this, in order of increasing importance.
First, an in place upgrade generally involves more service downtime or at least instability than a server swap. In-place upgrades generally take some time (possibly in the hours range), during which things may be at least a little bit unstable as core portions of the system are swapped around (such as core shared libraries, Apache and MySQL/PostgreSQL installs, the mailer, your IMAP server, and so on). A server swap is a few minutes of downtime and you're done.
Second, it's undeniable that an in-place upgrade is a bit more risky than a server replacement. With a server replacement you can build and test the replacement in advance, and you also can revert back to the old version of the server if there are problems with the new one (which we've had to do a few times). For most Linux servers, an in place OS upgrade is a one way thing that's hard to test.
(In theory you can test it by rebuilding an exact duplicate of your current server and then running it through an in-place upgrade, but if you're going to go to that much more work why not just build a new server to start with?)
But those are relatively small reasons. The big reason to rebuild from scratch is that an OS version change means that it's time to re-evaluate whether what you were customizing on the old OS still needs to be done, if you're doing it the right way, and if you now need additional customizations because of new things on the OS. Or, for that matter, because your own environment has changed and some thing you were reflexively doing is now pointless or wrong. Sometimes this is an obvious need, such as Ubuntu's shift from Upstart in 14.04 LTS to systemd in 16.04, but often it can be more subtle than that. Do you still need that sysctl setting, that kernel module blacklist, or that bug workaround, or has the new release made it obsolete?
Again, in theory you can look into this (and prepare new configuration files for new versions of software) by building out a test server before you do in-place upgrades of your existing fleet. In practice I think it's much easier to do this well and to have everything properly prepared if you start from scratch with the new version. Starting from scratch gives you a totally clean slate where you can carefully track and verify every change you do to a stock install.
Of course all of this assumes that you have spare servers that you can use for this. You may not for various reasons, and in that case an in-place upgrade can be the best option in practice despite everything I've written. And when it is your best option, it's great if your Linux (or other OS) actively supports it (Debian and I believe Ubuntu), as opposed to grudging support (Fedora) or no support at all (RHEL/CentOS).
2016-04-24
Why we have CentOS machines as well as Ubuntu ones
I'll start with the tweets that I ran across semi-recently (via @bridgetkromhout):
@alicegoldfuss: If you're running Ubuntu and some guy comes in and says 'we should use Redhat'...fuck that guy." - @mipsytipsy #SREcon16
mipsytipsy: alright, ppl keep turning this into an OS war; it is not. supporting multiple things is costly so try to avoid it.
This is absolutely true. But, well, sometimes you wind up with exceptions despite how you may feel.
We're an Ubuntu shop; it's the Linux we run and almost all of our machines are Linux machines. Despite this we still have a few CentOS machines lurking around, so today I thought I'd explain why they persist despite their extra support burden.
The easiest machine to explain is the one machine running CentOS 6. It's running CentOS 6 for the simple reason that that's basically the last remaining supported Linux distribution that Sophos PureMessage officially runs on. If we want to keep running PureMessage in our anti-spam setup (and we do), CentOS 6 is it. We'd rather run this machine on Ubuntu and we used to before Sophos's last supported Ubuntu version aged out of support.
Our current generation iSCSI backends run CentOS 7 because of the long support period it gives us. We treat these machines as appliances and freeze them once installed, but we still want at least the possibility of applying security updates if there's a sufficiently big issue (an OpenSSH exposure, for example). Because these machines are so crucial to our environment we want to qualify them once and then never touch them again, and CentOS has a long enough support period to more than cover their expected five year lifespan.
Finally, we have a couple of syslog servers and a console server that run CentOS 7. This is somewhat due to historical reasons, but in general we're happy with this choice; these are machines that are deliberately entirely isolated from our regular management infrastructure and that we want to just sit in a corner and keep working smoothly for as long as possible. Basing them on CentOS 7 gives us a very long support period and means we probably won't touch them again until the hardware is old enough to start worrying us (which will probably take a while).
The common feature here is the really long support period that RHEL and CentOS gives us. If all we want is basic garden variety server functionality (possibly because we're running our own code on top, as with the iSCSI backends), we don't really care about using the latest and greatest software versions and it's an advantage to not have to worry about big things like OS upgrades (which for us is actually 'build completely new instance of the server from scratch'; we don't attempt in-place upgrades of that degree and they probably wouldn't really work anyways for reasons out of the scope of this entry).
2016-04-09
Why your Ubuntu server stalls a while on boot if networking has problems
Yesterday I wrote on how to shoot yourself in the foot by making
a mistake in /etc/network/interfaces.
I kept digging into this today, and so now I can tell you why this
happens and what you can do about it. The simple answer is that it
comes from /etc/init/failsafe.conf.
What failsafe.conf is trying to do is kind of hard to explain
without a background in Upstart (Ubuntu's 'traditional' init system).
A real System V init system is always in a 'runlevel', and this
drives what it does (eg it determines which /etc/rcN.d directory
to process). Upstart sort of half abandons runlevels; they are not
built into Upstart itself and some /etc/init jobs don't use them,
but there's a standard Upstart event to set the runlevel and
many /etc/init jobs are started and stopped based on this runlevel
event.
Let's simplify that: Upstart's runlevel stuff is a way of avoiding
specifying real dependencies for /etc/init jobs and handling them
for /etc/rcN.d scripts. Instead jobs can just say 'start on
runlevel [2345]' and get started once the system has finished its
basic boot processing, whatever that is and whatever it takes.
Since the Upstart runlevel is not built in, something must generate
an appropriate 'runlevel N' event during boot at an appropriate
time. That thing is /etc/init/rc-sysinit.conf, which in turn
must be careful to run only at some appropriate point in Upstart's
boot process, once this basic boot processing is done. When is basic
boot processing done? Well, the rc-sysinit.conf answer is 'when
filesystems are there and static networking is up', by in Upstart
terms means when the filesystem(7)
and static-network-up upstart events
are emitted by something.
So what happens if networking doesn't come fully up, for instance
if your /etc/network/interfaces has a mistake in it? If Upstart
left things as they were, your system would just hang in early boot;
rc-sysinit.conf would be left waiting for an Upstart event that
would never happen. This is what failsafe.conf is there for. It
waits a while for networking to come up, and if that doesn't happen
it emits a special Upstart event that tells rc-sysinit.conf to
go on anyways.
In the abstract this is a sensible idea. In the concrete, failsafe.conf
has a number of problems:
- the timeout is hardcoded, which means that it's guaranteed to
be too long for some people and probably not long enough for
others.
- it doesn't produce any useful messages when it has to delay,
and if you're not using Plymouth
it's totally silent. Servers typically don't run Plymouth.
- Upstart as a whole has a very inflexible view of what 'static
networking is up' means. It apparently requires that every 'auto'
interface listed in
/etc/network/interfacesboth exist and have link signal (have a cable plugged in and be connected to something); see eg this bug and this bug. You don't get to say 'proceed even without link signal' or 'this interface is optional' or the like.
For Ubuntu versions that use Upstart, you can fix this by changing
/etc/init/failsafe.conf to shorten the timeouts and print out
actual messages (anything you output with eg echo will wind up
on the console). We're in the process of doing this locally; I
opted to print out a rather verbose message for my usual reasons.
Of course, all of this is going to be inapplicable in the upcoming
Ubuntu 16.04, since Ubuntu switched from Upstart to systemd as of
15.04 (cf).
However Ubuntu has put something similar to failsafe.conf
into their systemd setup and thus I expect that we'll wind up making
similar modifications to it in some way.
(A true native systemd setup has a completely different and generally more granular way of handling failures to bring up networking, but I don't expect Ubuntu to make that big of a change any time soon.)
2016-04-08
How to shoot yourself in the foot with /etc/network/interfaces on Ubuntu
Today I had one of those self inflicted learning experiences that I get myself into from time to time. I will start with the summary and then tell you the story of how I did this to myself.
The summary is that errors in /etc/network/interfaces can cause
your system to stall silently during boot for a potentially significant
amount of time.
One sort of error is a syntax error or omitting a line. Another sort of error is accidentally duplicating an IP address between an interface's primary address and one of its aliases. If you do the latter, you will get weird errors in log files and from tools that don't actually help you.
How I discovered this is that today I was doing a test install of a new web server in a VM image. Our standard practice for web server hosts is that we don't make their hostname be the actual website name; instead they have a real hostname and then one or more website names as aliases. On most of our web servers, these are IP aliases. However, we're running short of IP addresses on our primary network and when I set up this new host I decided to make its single website just be another A record to its single IP address.
When I reached the end of the install process, I'd forgotten this
detail; instead I thought the server needed the website name added as
an IP alias. So I looked up the IP address for the website name and
slavishly added to /etc/networks/interfaces something like:
auto eth0:0
address <IP>
netmask 255.255.255.0
network <blah>.0
(The sharp eyed will notice that there are two errors here.)
Then I rebooted the machine and it just sat there for quite a while.
After a couple of reboots and poking several things (eg, trying an
older kernel) I wound up looking at interfaces in a rescue shell
and noticed my silly mistake. Or rather, my obvious silly mistake:
I'd left out the 'iface eth0:0 inet static' before the address
et al. So I fixed that and rebooted the machine.
Imagine my surprise when the machine still hung during boot. But
this time I let it sit for long enough that the Ubuntu boot process
timed out whatever it needed to, and the machine actually came up.
When it did, I poked around to try to find out what was wrong and
eventually noticed that I had no eth0:0 alias device. This led
me to notice that the IP address I was trying to give to eth0:0
was the same address that eth0 already had, at which point I
finally figured out what was wrong and was able to fully correct
it.
The good news is that now I know another place to look if an Ubuntu machine has mysterious 'hang during boot' problems. (Technically it was a stall, but stalling several minutes with no messages about it is functionally equivalent to a hang from the sysadmin perspective.)
(This is why I test my install instructions in virtual machines before going to the bother of getting real hardware set up. Sometimes it winds up feeling overly nitpicky, and sometimes very much not.)