2014-01-31
Linux has at least two ways that disks can die
We lost a disk on one of our iSCSI backends last
night. Normally when an iSCSI data disk dies on a backend, what happens
at the observable system level is that the disk vanishes. If it used to
be, say, sdk, then there is no sdk any more. I'm not quite sure what
happens at the kernel level as far as our iSCSI target software goes,
but the reference that the iSCSI target kernel module holds doesn't
work any more. This is basically just the same as what happens when you
physically pull a live disk and I assume that the same kernel and udev
mechanisms are at work.
(When you swap out the dead disk and put a new one in, the new one shows up as a new disk under some name. Even if it winds up with the same sdX name it's sufficiently much a different device that our iSCSI target software still won't automatically talk to it; we have to carefully poke the software by hand.)
This is not what happened this time around. Instead the kernel seems
to have basically thrown up its hands and declared the disk dead but
not gone. The disk was still there in /dev et al and you could
open the disk device, but any attempt to do IO to it produced IO
errors. Physically removing the dead disk and inserting a new one did
nothing to change this; there doesn't seem to have been any hotplug
activity triggered or anything. All we got was a long run of errors
like:
kernel: sd 4:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdm, sector 504081380
(Kernel log messages suggest that possibly this happened because the kernel was unable to successfully reset the channel, but that's reading tea leaves very closely.)
I was going to speculate about this sort of split making sense, but I
don't actually know what level of the kernel this DID_BAD_TARGET
error comes from. So this could be a general kernel feature to declare
disks as 'present but bad' or this could be a low level driver reporting
a hardware status up the stack (or it could be something in between, where
a low-level driver knows the disk is not there but this news got lost at a
higher level).
Regardless of what and where this error means, we were still left with a situation where the kernel thought a disk was present when we had already physically removed it. In the end we managed to fix it by forcing a rescan of that eSATA channel with:
echo - - - >/sys/class/scsi_host/hostN/scan
That woke the kernel up to the disk being gone, at which point a newly inserted replacement disk was also recognized and we could go on as we usually do when replacing dead disks.
I'm going to have to remember these two different failure modes in the future. We clearly can't assume that all disk failures will be nice enough to cause the disk to disappear from the system, and thus we can't assume that all visible disks are actually working (and thus 'the system is showing N drives present as we expect' is not a full test).
(This particular backend has now been up for 632 days, and as a result of this glitch we are considering perhaps rebooting it. But reboots of production iSCSI backends are a big hassle, as you might imagine.)
2014-01-27
Things that affect how much support you get from a Linux distribution
Recently commentators have noted (here and here) that you may get less support than you expect from distributions like Debian and Ubuntu because both split packages into multiple sections and may apply different support policies to different sections. In my view, there are several factors that affect how much support you get in practice in this situation.
The obvious first factor is what the official support policies are for the different sections of the package repository. This can be a little bit hard to find out, but Debian's is here and Ubuntu's seems to be more or less here (assuming that Ubuntu LTS doesn't change the support picture, just lengthens it). Both fully support only their main section and seem to leave support for other sections up to the community.
(Debian is explicit about this; Ubuntu seems to imply it.)
The next question is what packages are in what section. In Debian
and Ubuntu there are two ways to determine this. First, you can use
command line queries such as 'apt-cache policy <package>' or
'apt-cache madison <package>'. It's a little bit tedious to do
this for a lot of packages so if you want to do this en masse you're
probably better off fetching the Packages file for eg Ubuntu
12.04 64-bit and
processing it directly. This will let you see the full list of
what's in the fully-supported Ubuntu main section; you can then
correlate it against the packages that you care about.
Finally, what really matters is what happens in practice. A large part of this is whether the normal community maintainers of packages that turn out to have security issues step up to fix them and the security team then pushes out security updates. Beyond that, we might see people stepping in to push security fixes into packages that they don't normally maintain (including people who normally maintain core packages or perhaps even the security team) for various reasons. This is something that can only be assessed as it happens or on a historical basis, ie you go back to look at what security releases got made for non-core packages.
(A thorough assessment would look both at whether non-core packages got security updates and whether known security issues in non-core packages weren't fixed. For that matter you should look to see if fully supported packages got prompt security updates or if some things took a while or dropped through the cracks.)
I don't have any particular answers here, since generating them for
even our own systems would take more work than I care to put in
right now. I did take an informal look through what Ubuntu packages
are in what section and almost everything I care significantly
about is in main, which Ubuntu theoretically fully supports.
(I looked at things like OpenSSH, Apache, Dovecot, Exim, and Samba.)
Sidebar: Another place to look
Ubuntu's security repository segments things by section, which means
that you can get a list of package security updates for universe,
multiverse, and so on. Checking shows me that Ubuntu has released
at least some 12.04 security updates for packages outside of main.
I have no idea how comprehensive these updates are (ie, how many
packages outside of main have security issues but no updates).
(See eg here for 12.04.)
Debian probably does the same thing but I haven't checked.
2014-01-16
Debian does not have long term support
Every so often someone says that Debian's stable releases have a long support period. Unfortunately this is what one would call 'wrong', and for at least two reasons.
(For now, let us define 'support' here as 'gets security fixes'.)
First, it is wrong as a plain matter of fact. No Debian release has ever been supported for more than a sliver over four years, and only one release has hit that mark (Debian 3.0 'woody', released mid 2002 and supported through mid 2006, per here). Every release since 2005 has been supported for only about three years. Ubuntu LTS manages five years; Red Hat Enterprise Linux goes even longer.
Second, it is wrong as a philosophical matter because Debian doesn't promise any particular support period. Debian doesn't promise to support a release for X years, just more or less to support it for a year after the next release comes out. If the next release comes out in roughly two years (as has been the case since 2005), you get three years. If the next release comes out in a year, you get two years. And so on. The only way you get long support periods is if Debian is painfully slow to make releases.
This has two consequences. First, Debian support periods are unpredictable. If you install a machine with Debian, you have no sure idea how long you'll have support for (although you can often make an informed guess). Second, the real support period for a machine can be as low as a year, if you have to install a machine shortly before the next release comes out.
(In theory the minimum period is even lower, but this would likely require Debian to do two releases in a year. This seems, well, unlikely.)
Real long term support involves three things. First, you must commit up front to a definite support period (as Ubuntu and Red Hat do). Second, you must actually have a long support period (which is always shorter than it looks in practice); three years doesn't really cut it even if Debian committed to that for releases (which they are not going to do so). Third, you need a significant support overlap between the current release and the previous release because of the real support period issue.
Debian does none of these, which is fair enough; Debian doesn't claim it has long term support. I just wish people would stop claiming that it did on its behalf.
2014-01-15
SELinux fails again (Fedora 20 edition)
I've always run SELinux on my laptop; it's how Fedora installs things, it's worked without problems, and despite all the bad things I've said about it I sort of consider SELinux to be the right thing to do so I wanted to keep with it. And unlike my other machines, my laptop is a completely stock setup at the system level and I don't do anything unusual on it. Then I upgraded to Fedora 20 with yum and things exploded.
(Some of the problems were fixed by 'restorecon -R /', which is not
unreasonable given that I upgraded with yum.)
The major thing SELinux did was it prevented NetworkManager from setting up IPSec for L2TP VPNs. This is a new failure in Fedora 20 (it used to work in Fedora 19). This is actually a very bad failure because for whatever reason NetworkManager was willing to keep on going and set up a L2TP 'VPN' connection without the IPSec encryption, giving me less a Virtual Private Network and more a Virtual Plaintext Network. So let me emphasize this:
SELinux significantly reduced my security in practice.
I would have been much better off without SELinux because then I would have had a VPN that was actually encrypted instead of one that I just thought was encrypted and that was instead allowing any random bystander to snoop my wireless traffic.
(This is where some clever person blames NetworkManager instead for being willing to continue setting up a L2TP VPN without IPSec. No. Wrong. The simple fact is that things worked securely without SELinux and they didn't work with SELinux. Ergo, SELinux is the party that broke things. Arguing that it is not SELinux's fault is not solving the actual security problem here. SELinux made my system less secure, regardless of exactly how that happened. If you argue that this doesn't matter you are not interested in security, you are interested in mathematics. Please stay away from my systems.)
Now let us talk about SELinux's bad user interface failures. In the
process of going back and forth with SELinux and my Fedora 20 upgrade, I
did a bunch of flailing around. I followed the directions of the SELinux
alert widget to add some new policies with audit2allow to try to fix
things in a relatively graceful way, I silenced some alerts when I was
running in permissive mode before I found restorecon and thought it
had solved all of my problems, and so on. What I would like to do now is
clear all of that away and revert to a stock Fedora 20 SELinux setup so
that I can dutifully report all of these policy problems.
I haven't been able to find out how to do so.
I am a relatively experienced sysadmin. I can read manpages, scan Python code, grep everything in sight, and so on. I have utterly failed to find out how to revert to a stock policy or to un-silence various alerts so I can use the nice alert program to report them as bugs. At this point it appears to be literally impossible for me to do this without installing Fedora 20 from scratch, and that's not going to happen.
(I'm relatively sure that it isn't literally impossible and that there is some magic incantation somewhere.)
As mentioned, I'm a sysadmin. If I can't figure out how to do this, what chance does a regular user have? In fact, what chance does a regular user have to make SELinux work in general when something like this happens? Even adding a policy exemption takes manual cut and paste work (and knowing what certain Unix documentation conventions are). Real security absolutely must be usable. SELinux is not and this is its largest failure.
The golden rule is what I said on Twitter: people use their computers to get things done. If your security system gets in the way of getting things done, people will remove it. If they can't figure out how to remove it, they will remove the entire system. If Linux is lucky, this will involve installing Ubuntu instead of Fedora. If Linux is not lucky, this results in another user saying 'well, Linux doesn't work, I guess it's time for Windows'.
(Have I filed bugs about this? Of course not. I can't. See above. To file bugs on anything apart from 'you have a massive UI fail here' I would have to install Fedora 20 from scratch, overwriting all of my customization and setup work.)
2014-01-13
Sadly, we're moving away from Red Hat Enterprise Linux
There are two versions of this story. In one version I'd start by noting that RHEL never really caught on here because it wasn't enough better than Ubuntu LTS to be mostly worth the hassle, mention our iSCSI backends as the exception, and then explain that as part of turning over that entire infrastructure we decided that we might as well run Ubuntu LTS on the backends instead of RHEL. This is somewhat more sensible than it might look; our backends are essentially appliances that we almost never update, we don't allow access to them, and they run custom kernels and software anyways so the actual distribution is doing almost nothing.
This version of the story is true but it is not the real story, or at least not the full story. To start the full story I have to say that when I say 'RHEL' here I really mean RHEL, not CentOS, because for a long time the university has had an (inexpensive) site-wide RHEL license. With that background, there are three reasons we're moving away from RHEL.
First, this year we had major, multi-month problems getting RHEL to renew our site license and for a while it looked like we would not be able to afford to renew it at all. This was quite disruptive, a lot of people are unhappy with Red Hat (especially the people who actually pay for the license), and the future of our site license is now both uncertain and precarious. One way or another it's probably not lasting more than a year or two more. This means moving away from genuine RHEL to, say, CentOS.
Second, I'm not enthused about RHEL 6 and the timing for using it is not great because it's now been three years since it was released. As a result I have no real enthusiasm for using RHEL 6 on anything right now.
What we'd like to use is RHEL 7, but there are two problems together that make up the third reason. Not only is RHEL 7 not out yet, but our forced switch to CentOS means that we need to wait not just for the RHEL 7 release but for CentOS to (re)build a CentOS 7. This is a risky wait because it's a lot of work; it's almost certainly going to take CentOS a not insignificant amount of time to do. Between both delays, it could be a year before CentOS 7 is available and we need new iSCSI backends long before then.
(When I looked into RHEL 6 and said 'let's wait until RHEL 7', as covered in an earlier entry, it was pretty early in 2013 and the first and the third thing hadn't happened yet. So I expected RHEL 7 before the end of 2013 and did not know we would hit major license issues.)
I'm sad about this. I genuinely like RHEL and generally prefer it to Ubuntu LTS. But through this combination of factors it is effectively dead here.
Sidebar: the risks of CentOS
I've argued before that the risks of using CentOS instead of RHEL are relatively small once CentOS has put out an initial major release (eg CentOS 7). However, 'small' is not 'zero' and yes, this factor is in the back of my mind when I consider using CentOS here.
(The recent Red Hat acquisition of CentOS does not reduce this. If anything it increases it, since Red Hat has been very bland about why it's doing this and whether (and how) CentOS is going to change as a result of it. As a result plenty of people are speculating about potential results and changes, some of which are things we would very much not want.)
2014-01-11
Why I am not enthused about Red Hat Enterprise 6
I have to admit straight off the bat that this is mostly an uninformed prejudice. I have not actually run RHEL 6 machines (our few RHEL machines are RHEL 5); all that I've ever done with it is install it a couple of times to test some things. So part of this is based on general knowledge of what it has and part of this is based on those install experiences.
My impression of the install experience was not positive. Although I don't remember details, it struck me as generally less functional and more annoying than the RHEL 5 equivalent. I could make it work but I didn't like it. And of course the capstone of the install experience is that it uses NetworkManager (in a situation that NM is not good for) and then leaves your networks down when you boot the installed system. This means that the very first post-install thing we'd have to do with RHEL 6 is to reconfigure all the networking, ripping NM out and putting the old ways back in.
Beyond that, RHEL 6 is simply built on an awkwardly transitional base because it was done at a bad time. It's based on Fedora 12 plus chunks of 13 and 14, which puts it just before Fedora made a number of important changes such as moving from upstart to systemd, a major change in hotplug device handling, and so on. You get some changes from the old ways of RHEL 5 but they are by and large the wrong changes, ones that would later be abandoned. And you get other changes that were only half-baked at the time of Fedora 12, such as NetworkManager (especially on servers). All of this leaves me unenthused.
We are not big RHEL users here, so my general plan (to the extent that I had any) was to skip RHEL 6 entirely and wait for RHEL 7 for, say, new iSCSI backends. Various things have gone wrong with that, but that's another entry.
2014-01-08
The good and bad of Linux's NetworkManager
I have a conflicted relationship with NetworkManager that gives me rather divided opinions of it. The short version is that sometimes it's good and other times it's terrible, depending mostly on what sort of machine you're using it on (by choice or otherwise).
Where NetworkManager is good is on graphical machines with relatively simple network configurations, especially if they move between networks. This is typical for 'plug it into the network and do DHCP' desktops and for laptops in general. In most distributions, NM is going to be by far the easiest way to manage roving between wired networking and one or more wireless networks, possibly with VPNs on top of them. Although there are rough edges, especially in Gnome 3 and derived desktops, everything is generally easily discoverable and manageable without hassle or long reads of manual pages.
(I'm sure you can build a suite of tools that work just as well as NM. The great advantage of NM on a graphical machine like this is that someone has already done all of the work for you.)
Where NetworkManager is bad is on servers or on machines with complex networking configurations (where by this I mean things like bridged VLANs with policy based routing and per-network firewall rules, nailed up IPSec tunnels, and so on). On servers without graphics and with static network configurations, NetworkManager is overkill, over-complication at boot time, and hard to manage. While I'm not intrinsically opposed to setting up networks through commands instead of configuration files, NetworkManager's command line programs come across very much as underdocumented and incomplete afterthoughts.
(Also, defaults like 'interfaces are not enabled until someone logs in' are completely wrong for servers and apparently too hard-coded for people like Red Hat to change.)
Given that people are using NetworkManager by default on distributions aimed at servers, I find its current limitations there to be extremely irritating. It wouldn't take all that much work to make NetworkManager fully usable in a server environment and with the right features it could actually offer some interesting capabilities that you can't get easily today.
(For instance NetworkManager knows right away when link status goes away or comes back on an Ethernet interface. There are server environments where that would be very handy to know and to react to.)
There are also simple things that NetworkManager could add to make itself much more useful in a complex server environment. The easiest one is to the option to run a command for you when a specific network came up or went down, which would give people with complex needs a hook where they could take care of things that NetworkManager can't handle itself. I don't think that NetworkManager ever will add something like that, though, because its goal seem to be to completely own and control the machine's networking inside itself.
(Arguably this is the core problem with NetworkManager in sophisticated environments; in them it's never, ever going to be the sole arbiter over all networking. If it insists on all or nothing in practice the answer must be 'nothing'.)