2016-11-23
Sometimes a little change winds up setting off a large cascade of things
(This is a sysadmin war story.)
We have a password master machine, which runs some version of Ubuntu LTS like almost all of our machines. More specifically, it currently runs Ubuntu 12.04 and we need to upgrade it to Ubuntu 16.04. Naturally upgrading our master machine for passwords requires testing, which is a good thing because I wound up running into a whole cascade of interesting issues in the process. So today I'm going to walk through how one innocent change led to one thing after another.
Back in the Ubuntu 12.04 days, we set our machines up so that
/bin/sh
was Bash. I don't think this was the Ubuntu default for
12.04, but it was the default in the Ubuntu LTS version we started
with and we're busy sysadmins.
In 2014, we changed our Ubuntu 14.04 machines from Bash to the
default of dash as /bin/sh
(after finding issues with Bash) but left the 12.04 machines
alone for various reasons.
(This change took place in stages, somewhat prompted by Shellshock, and we were fixing up Bashisms in our scripts for a while. By the way, Bashisms aren't necessarily a bug.)
Our password change process works in part by using a PAM module to
run a script that does important things like push the changed
password to Samba on our Samba servers (possibly there is a better
way to do this with PAM today, but there is a lot of history here
and it works). This script was written as a '#!/bin/sh
' script,
but it turns out that it was actually using some Bashisms, which
had gone undetected before now because this was the first time we'd
tried to run it on anything more recent than our 12.04 install.
Since I didn't feel like hunting down all of the issues, I took the
simple approach; I changed it to start '#!/bin/bash
' and resumed
testing.
I was immediately greeted by a log message to the effect that bash
couldn't run /root/passwd-postprocess
because of permission denied.
It took quite a lot of iterating around before I found the real
cause; our PAM module was running
the script directly from the setuid passwd
program, so only its
effective UID was root and it turned out that both Bash and dash
(as /bin/sh
) were freaking out over this, although in different
ways. Well, okay, I could fix that by telling Bash that everything
was okay by using '#!/bin/bash -p
'.
Things still failed, this time later on when our passwd-postprocess
script tried to run another shell script; that second shell script
needed root permissions, but because it started with only '#!/bin/sh
',
its shell freaked out about the effective UID things and immediately
dropped privileges, causing various failures. At this point I saw
the writing on the wall and changed our PAM module to run
passwd-postprocess
as root via setuid()
(in the process I cleaned
up some other things).
So that's the story of how the little change of switching /bin/sh
from Bash to dash caused a cascade of issues that wound up with me
changing how our decade-old PAM module worked. Every step of the
way from the /bin/sh
change to the PAM module modifications is
modest and understandable in isolation, but I find the whole
cascade rather remarkable and I doubt
I would have predicted it in advance even if I'd had all of the
pieces in my mind individually.
(This is sort of related to fragile complexity, much like performance issues.)
We may have seen a ZFS checksum error be an early signal for later disk failure
I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.
A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.
Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.
(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)
This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.
(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)
PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.