Wandering Thoughts archives

2016-11-23

Sometimes a little change winds up setting off a large cascade of things

(This is a sysadmin war story.)

We have a password master machine, which runs some version of Ubuntu LTS like almost all of our machines. More specifically, it currently runs Ubuntu 12.04 and we need to upgrade it to Ubuntu 16.04. Naturally upgrading our master machine for passwords requires testing, which is a good thing because I wound up running into a whole cascade of interesting issues in the process. So today I'm going to walk through how one innocent change led to one thing after another.

Back in the Ubuntu 12.04 days, we set our machines up so that /bin/sh was Bash. I don't think this was the Ubuntu default for 12.04, but it was the default in the Ubuntu LTS version we started with and we're busy sysadmins. In 2014, we changed our Ubuntu 14.04 machines from Bash to the default of dash as /bin/sh (after finding issues with Bash) but left the 12.04 machines alone for various reasons.

(This change took place in stages, somewhat prompted by Shellshock, and we were fixing up Bashisms in our scripts for a while. By the way, Bashisms aren't necessarily a bug.)

Our password change process works in part by using a PAM module to run a script that does important things like push the changed password to Samba on our Samba servers (possibly there is a better way to do this with PAM today, but there is a lot of history here and it works). This script was written as a '#!/bin/sh' script, but it turns out that it was actually using some Bashisms, which had gone undetected before now because this was the first time we'd tried to run it on anything more recent than our 12.04 install. Since I didn't feel like hunting down all of the issues, I took the simple approach; I changed it to start '#!/bin/bash' and resumed testing.

I was immediately greeted by a log message to the effect that bash couldn't run /root/passwd-postprocess because of permission denied. It took quite a lot of iterating around before I found the real cause; our PAM module was running the script directly from the setuid passwd program, so only its effective UID was root and it turned out that both Bash and dash (as /bin/sh) were freaking out over this, although in different ways. Well, okay, I could fix that by telling Bash that everything was okay by using '#!/bin/bash -p'.

Things still failed, this time later on when our passwd-postprocess script tried to run another shell script; that second shell script needed root permissions, but because it started with only '#!/bin/sh', its shell freaked out about the effective UID things and immediately dropped privileges, causing various failures. At this point I saw the writing on the wall and changed our PAM module to run passwd-postprocess as root via setuid() (in the process I cleaned up some other things).

So that's the story of how the little change of switching /bin/sh from Bash to dash caused a cascade of issues that wound up with me changing how our decade-old PAM module worked. Every step of the way from the /bin/sh change to the PAM module modifications is modest and understandable in isolation, but I find the whole cascade rather remarkable and I doubt I would have predicted it in advance even if I'd had all of the pieces in my mind individually.

(This is sort of related to fragile complexity, much like performance issues.)

sysadmin/LittleChangeCascadeStory written at 23:01:59; Add Comment

We may have seen a ZFS checksum error be an early signal for later disk failure

I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.

A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.

Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.

(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)

This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.

(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)

PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.

solaris/ZFSChecksumErrorMaybeSignal written at 00:29:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.