Sometimes a little change winds up setting off a large cascade of things

November 23, 2016

(This is a sysadmin war story.)

We have a password master machine, which runs some version of Ubuntu LTS like almost all of our machines. More specifically, it currently runs Ubuntu 12.04 and we need to upgrade it to Ubuntu 16.04. Naturally upgrading our master machine for passwords requires testing, which is a good thing because I wound up running into a whole cascade of interesting issues in the process. So today I'm going to walk through how one innocent change led to one thing after another.

Back in the Ubuntu 12.04 days, we set our machines up so that /bin/sh was Bash. I don't think this was the Ubuntu default for 12.04, but it was the default in the Ubuntu LTS version we started with and we're busy sysadmins. In 2014, we changed our Ubuntu 14.04 machines from Bash to the default of dash as /bin/sh (after finding issues with Bash) but left the 12.04 machines alone for various reasons.

(This change took place in stages, somewhat prompted by Shellshock, and we were fixing up Bashisms in our scripts for a while. By the way, Bashisms aren't necessarily a bug.)

Our password change process works in part by using a PAM module to run a script that does important things like push the changed password to Samba on our Samba servers (possibly there is a better way to do this with PAM today, but there is a lot of history here and it works). This script was written as a '#!/bin/sh' script, but it turns out that it was actually using some Bashisms, which had gone undetected before now because this was the first time we'd tried to run it on anything more recent than our 12.04 install. Since I didn't feel like hunting down all of the issues, I took the simple approach; I changed it to start '#!/bin/bash' and resumed testing.

I was immediately greeted by a log message to the effect that bash couldn't run /root/passwd-postprocess because of permission denied. It took quite a lot of iterating around before I found the real cause; our PAM module was running the script directly from the setuid passwd program, so only its effective UID was root and it turned out that both Bash and dash (as /bin/sh) were freaking out over this, although in different ways. Well, okay, I could fix that by telling Bash that everything was okay by using '#!/bin/bash -p'.

Things still failed, this time later on when our passwd-postprocess script tried to run another shell script; that second shell script needed root permissions, but because it started with only '#!/bin/sh', its shell freaked out about the effective UID things and immediately dropped privileges, causing various failures. At this point I saw the writing on the wall and changed our PAM module to run passwd-postprocess as root via setuid() (in the process I cleaned up some other things).

So that's the story of how the little change of switching /bin/sh from Bash to dash caused a cascade of issues that wound up with me changing how our decade-old PAM module worked. Every step of the way from the /bin/sh change to the PAM module modifications is modest and understandable in isolation, but I find the whole cascade rather remarkable and I doubt I would have predicted it in advance even if I'd had all of the pieces in my mind individually.

(This is sort of related to fragile complexity, much like performance issues.)

Written on 23 November 2016.
« We may have seen a ZFS checksum error be an early signal for later disk failure
Why we don't and can't use the pam_exec PAM module »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 23 23:01:59 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.