Wandering Thoughts archives

2016-11-25

Why we don't and can't use the pam_exec PAM module

Yesterday I mentioned that we have a locally written PAM module that runs a shell script to do various post-password-change things. If you're reasonably familiar with PAM modules, you may be reminded of the pam_exec module, and you might even be wondering why we don't just use it instead of having our own module. That's actually a good question, and when I was working on this recently I wondered it myself and went as far as setting it up and testing it to see if we could use it.

Sadly, it turns out that the answer fits in a Tweet:

pam_exec(passwd:chauthtok): expose_authtok not supported for type password

That's the sound of my clever PAM idea going down in flames.

Pam_exec has an expose_authtok option that sends the user's password to your script on standard input, which is exactly what we need in order to do things like propagate the new password into our Samba servers. Except it unfortunately isn't supported when you're changing passwords. I don't know why. If it's that expose_authtok is not really right for the password change case, I don't know why there isn't a similar option specifically to expose the new password.

(No doubt the PAM people have their reasons, and this is arguably sort of documented because the option is described with the phrase 'during authentication'.)

This may be the first time I've looked at pam_exec, but if so it probably shouldn't have been. Pam_exec dates back to 2006 (according to the git history of the current linux-pam repo), while our PAM module only dates to 2010, so pam_exec was available at the time (even on the Ubuntu LTS version we would have been using). It's possible that the version of pam_exec that we had available at the time lacked the expose_authtok option, which would have made it obviously unsuitable.

(The option was added in 2009, but in early 2010 when we set up our PAM module we were using Ubuntu 8.04, which almost certainly would not have backported that into the 8.04 version.)

We next came near our PAM module at the end of 2012, when we upgraded our password master machine to Ubuntu 12.04. 12.04 has a version of pam_exec with the expose_authtok option, so it would have been worth trying if I'd noticed it (and then I'd have found out it didn't work). Instead, I think I didn't bother looking to see if there now was a standard module that would work; I just recompiled and tested our custom module.

Will I look again at pam_exec in the future? Maybe. Writing this entry makes it more likely, but said future is four years away (when Ubuntu 16.04 stops being supported) and my memory is likely to have faded by then. And anyways, I suspect that it still won't have any way of feeding our script the user's new password. If the PAM people haven't done that by now, they probably feel they have a good reason for not having that functionality.

(For all I know, how our module operates is a hack that only works in a subset of PAM environments. My six year old memory is that how you write PAM modules and get at things like the user's new password is somewhat underdocumented, with the inevitable result.)

linux/PamExecWhyNot written at 02:11:01; Add Comment

2016-11-23

Sometimes a little change winds up setting off a large cascade of things

(This is a sysadmin war story.)

We have a password master machine, which runs some version of Ubuntu LTS like almost all of our machines. More specifically, it currently runs Ubuntu 12.04 and we need to upgrade it to Ubuntu 16.04. Naturally upgrading our master machine for passwords requires testing, which is a good thing because I wound up running into a whole cascade of interesting issues in the process. So today I'm going to walk through how one innocent change led to one thing after another.

Back in the Ubuntu 12.04 days, we set our machines up so that /bin/sh was Bash. I don't think this was the Ubuntu default for 12.04, but it was the default in the Ubuntu LTS version we started with and we're busy sysadmins. In 2014, we changed our Ubuntu 14.04 machines from Bash to the default of dash as /bin/sh (after finding issues with Bash) but left the 12.04 machines alone for various reasons.

(This change took place in stages, somewhat prompted by Shellshock, and we were fixing up Bashisms in our scripts for a while. By the way, Bashisms aren't necessarily a bug.)

Our password change process works in part by using a PAM module to run a script that does important things like push the changed password to Samba on our Samba servers (possibly there is a better way to do this with PAM today, but there is a lot of history here and it works). This script was written as a '#!/bin/sh' script, but it turns out that it was actually using some Bashisms, which had gone undetected before now because this was the first time we'd tried to run it on anything more recent than our 12.04 install. Since I didn't feel like hunting down all of the issues, I took the simple approach; I changed it to start '#!/bin/bash' and resumed testing.

I was immediately greeted by a log message to the effect that bash couldn't run /root/passwd-postprocess because of permission denied. It took quite a lot of iterating around before I found the real cause; our PAM module was running the script directly from the setuid passwd program, so only its effective UID was root and it turned out that both Bash and dash (as /bin/sh) were freaking out over this, although in different ways. Well, okay, I could fix that by telling Bash that everything was okay by using '#!/bin/bash -p'.

Things still failed, this time later on when our passwd-postprocess script tried to run another shell script; that second shell script needed root permissions, but because it started with only '#!/bin/sh', its shell freaked out about the effective UID things and immediately dropped privileges, causing various failures. At this point I saw the writing on the wall and changed our PAM module to run passwd-postprocess as root via setuid() (in the process I cleaned up some other things).

So that's the story of how the little change of switching /bin/sh from Bash to dash caused a cascade of issues that wound up with me changing how our decade-old PAM module worked. Every step of the way from the /bin/sh change to the PAM module modifications is modest and understandable in isolation, but I find the whole cascade rather remarkable and I doubt I would have predicted it in advance even if I'd had all of the pieces in my mind individually.

(This is sort of related to fragile complexity, much like performance issues.)

sysadmin/LittleChangeCascadeStory written at 23:01:59; Add Comment

We may have seen a ZFS checksum error be an early signal for later disk failure

I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.

A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.

Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.

(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)

This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.

(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)

PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.

solaris/ZFSChecksumErrorMaybeSignal written at 00:29:49; Add Comment

2016-11-21

Link: RFC 6919: Further Key Words for Use in RFCs to Indicate Requirement Levels

If you read RFCs, you may know about the standard use of the MUST, SHOULD, and so on key words that come from RFC 2119. RFC 6919, issued April 1st 2013, adds some useful additional key words like "MUST (BUT WE KNOW YOU WON'T)", "REALLY SHOULD NOT", and the like.

By itself this would be amusing and interesting. But what really makes RFC 6919 rewarding to read is that it shows usage examples for each of its new key words that are drawn from existing RFCs. If you have much exposure to how RFCs are actually implemented in the field, this will make you alternate laughter and sad sighs. To quote myself from when I first saw it:

RFC 6919 makes me laugh but it's sad laughter. Such honesty had to be published Monday.

(I was reminded of RFC 6919 by @whitequark's tweet, and was actually surprised to discover that I've never linked to it here on Wandering Thoughts. So now I'm fixing that.)

links/RFC6919MoreKeywords written at 22:11:39; Add Comment

What I'd like in Illumos/OmniOS: progressive crash dumps

One of our fileservers had a kernel panic today as we were adding some more multipathed iSCSI disks to it. This was unfortunate but not fatal; we caught the panic almost right away and fixed things relatively fast. Which is unfortunate in its own way and brings me to my wish.

You see, this was perhaps our most important and core fileserver. Everything depends on it and everything eventually goes out to lunch if and while it's down. And in our experience, in our environment, making an OmniOS crash dump takes ages and may not succeed at the end of that (we've sat through over half an hour of the process only to have it fail). There was absolutely no way we could afford to let this fileserver sit there for minutes or tens of minutes to see if maybe it could successfully write out a crash dump this time around, so we forced a power cycle on it in order to get it back into service. The result is that we got nothing out of the panic; we don't even have the stack backtrace (it doesn't seem to have gotten written anywhere durable).

So now what I wish OmniOS had is what I'll call progressive crash dumps. A progressive crash dump would proceed in layers of detail. First it would write out very basic details (like the panic stack dump or the kernel message log) in a compact form, right away; this should hopefully take almost no time. After that had been pushed to the dump device, it would write another layer with more information that takes some more time (maybe a complete collection of various core kernel data tables, like all kernel stacks and the process table). As time went on it would write out more and more data with more and more layers of detail; if you had enough time, it would end up writing out the full crash dump that you get today.

(Dumpadm's -c argument doesn't have enough granularity to help, especially on fileservers where almost all the memory is already being consumed by kernel pages instead of user pages.)

Progressive crash dumps would insure that even if you had to reboot the machine early you would get some information; the longer you could afford to wait, the more information you'd get. And if the overall dump winded up failing or hanging, at least you would recover however many layers could be written intact (and hopefully the very basic layers would be good, simply because they are basic and so should be easy and reliable to dump).

(This is a complete blue sky wish. It would likely take a completely new dump format, new kernel dump code, and significant changes to get all of the dump tools to deal with it, all of which adds up to a lot of new code in an area that has to be extremely reliable under extreme conditions and that most people don't use very much anyways. Even if we had the money to help fund this sort of thing, there would be much higher priority Illumos things we'd care about, like our 10G Ethernet issues.)

solaris/WantingProgressiveCrashDumps written at 21:42:32; Add Comment

I've wound up feeling tentatively enthusiastic about Python 3

I know myself, so I know that I'm prone to bursts of enthusiasm with things that start abruptly and then wear off later into more moderate and sensible views (or all the way down to dislike). In the past I've been quite down on Python 3, and even recently I was only kind of lukewarm on it, but for no really good reason I've lately wound up feeling pretty enthused about working in it.

Part of this is certainly due to my recent positive experience with it (and also), but I think it was building even before then. There was definitely a push from Eevee's Why should I use Python 3?, which left me feeling that there really were a number of interesting things in Python 3 that I'd kind of like to actually use; it may be the first thing that really sold me on Python 3 as having genuine attractions, instead of just being something that I''d have to put up with in the future.

I call this a tentative enthusiasm because it could burn out, not because I feel very tentative about it. Although I may be talking myself into it here, if I was starting a new Python program now I'd probably try to do it in Python 3 if that was practical (ie, if it didn't have to run on our OmniOS machines). If everywhere that DWiki ran had modern versions of Python 3, it'd be tempting to start a serious project to port it to Python 3 (going beyond my quick bring-up experiment to handle the tough issues, like bytes to Unicode conversions in the right places).

Unfortunately for my enthusiasm, I don't see much need for new Python code around here in the near future. I'm a sysadmin not a programmer, and beyond that we mostly prefer to write shell scripts. I tend to write at most a handful of new Python programs a year. I suppose that I could take some of my personal Python sysadmin programs and convert them to Python 3 for the experience, but that feels sort of make-work; there's no clear advantage to a straight conversion.

(The reason to convert DWiki itself to Python 3 is partly for longevity, since I already know I have to do it sometime, partly because I'd gain a lot of practical experience, and to be honest partly because it seems like an interesting challenge. Converting little utility programs is, well, a lot less compelling.)

PS: Part of this new enthusiasm is likely due to my slow shift into an attitude of 'let's not fight city hall, it takes too much work', as seen in my shift on Python indentation. Python 3 is the future of Python, so I might as well embrace it instead of bitterly clinging to Python 2 because I'm annoyed at the shift.

(Partly I'm writing this entry as a marker, so that I can later look back to see how I felt about things right now and maybe learn something from that.)

python/Python3NewEnthusiasm written at 01:01:21; Add Comment

2016-11-20

Is it time to not have platform-dependent integer types in languages?

I recently linked to an article on (security) vulnerabilities that got introduced into C code just by moving it from 32-bit to 64-bit platforms. There are a number of factors that contributed to the security issues that the article covers very well, but one of them certainly is that the size of several C types varies from one platform to another. Code that was perfectly okay on a platform with 32-bit int could blow up on a platform with a 64-bit int. As an aside at the end of my entry, I wondered aloud if it was time to stop having platform-dependent integer types in future languages, which sparked a discussion in the comments.

So, let me talk about what I mean here, using Go as an example. Go has defined a set of integer types of specific sizes; they have int8, int16, int32, and int64 (and unsigned variants), all of which mean what you think they mean. Go doesn't explicitly specify a number of platform dependent issues around overflow and over-shifting variables and so on, but at least if you use a uint16 you know that you're getting exactly 16 bits of range, no more and no less, and this is the same on every platform that Go supports.

A future hypothetical language without platform-dependent integer types would have only types of this nature, where the bit size was specified from the start and was the same on all supported platforms. This doesn't mean that the language can't add more types over time; for example, we might someday want to add an int128 type to the current set. Such a language would not have a generic int type; if it had something called int, it would be specified from the start as, for example, a 32-bit integer that was functionally equivalent to int32 (although not necessarily type-equivalent).

(As such a language evolves it might also want to deprecate some types because they're increasingly hard to support on current platforms. Even before such types are formally deprecated, they're likely to be informally avoided because of their bad performance implications; important code will be rewritten or translated to avoid them and so on. However this may not be a good answer in practice and certainly even source level rewrites can open up security issues.)

The counterpoint is that this is going too far. There are a lot of practical uses for just 'a fast integer type', however large that happens to be on any particular platforms, and on top of that most new languages should be memory-safe with things like bounds-checked arrays and automatic memory handling. Explicit integer sizes don't save you from assumptions like 'no one can ever allocate more than 4 GB of memory', either.

(You might also make the case that the enabling C thing to get rid of is the complex tangle of implicit integer type conversion rules. Forcing explicit conversions all of the time helps make people more aware of the issues and also pushes people towards harmonizing types so they don't have to keep writing the code for those explicit conversions.)

programming/VariableSizeIntegersMaybeNot written at 01:25:03; Add Comment

2016-11-19

Why I don't think subscription-based charging gets you stability

In a comment on my entry on the problem of stability and maintenance (which talks about how this is a problem even in commercial software), Christopher Barts wrote in part:

It's an argument for subscription-based software, if anything: If the only guaranteed revenue stream for a piece of software is selling new versions, the software will never be done. If people pay simply to use the software, the software can be done when it's done, and the company isn't losing anything.

My impression is that this is a relatively common viewpoint and I certainly see the attraction of it, but as I sort of alluded to in my entry I don't think it's actually going to work in many situations and with many companies.

The core problem is that most companies are looking for growth, not just (revenue) stability or near stability. Growth almost always requires bringing in new customers; heck, even revenue stability requires some amount of that, since you always lose some amount of old customers over time. If you want to grow significantly, you obviously need a fairly decent number of new customers.

It's almost inevitable that your software offering will stop being quite state of the art almost as soon as you release it. The result is that the core attractiveness of an unchanging piece of software generally decreases over time; if you haven't changed your software for a few years, it's almost certainly not as attractive as other people's current offerings. To some degree you can make this up with good support and promises of long term stability and so on, but I don't think this lasts forever. So if you don't change your offering, it becomes less and less attractive to new customers as time goes by. In short, when you stand still while other people are moving forward, you fall behind.

(To a certain extent your offering also becomes less attractive even to your existing customers; some of them will want the new features they can get elsewhere, even at the cost of some change and so on, and will leak away over time if you do nothing.)

You can see where this is going. If you want growth, you need new customers. If you want new customers, you must change your software to keep it up to date with the alternatives so that new customers will choose you instead of those alternatives. And that goes straight against the desires of your current subscribers who just want stability and no changes apart from bugfixes. Most companies want growth (or in some cases absolutely need it, like startups), so they're going to opt for making enough changes to hopefully attract new customers without irritating too many of their existing subscribers.

(In theory companies could maintain two versions, one stable one that only gets bugfixes for existing subscribers, and one that gets new features to attract new customers. In practice you soon wind up maintaining N versions, one for each generation of customers who you promised that this version would be stable. This can be done, but it generally increases the costs more and more so those subscriptions aren't going to be cheap.)

A subscription based approach does have the great advantage that you aren't compelled to put out new releases (necessarily with new changes) every so often just to get any revenue; you get an ongoing revenue stream just from people continuing to like your software. With subscriptions, new 'releases' are no longer an existential necessity, but they're still important for most companies and for most software.

(There probably is some commercially appealing software that can basically be 'finished' and need no further significant development to remain attractive.)

tech/GrowthVersusSubscriptions written at 00:09:13; Add Comment

2016-11-18

Unix shells and the problem of too-smart autocompletion

Unix shells have been doing basic file-based autocompletion for decades, but of course basic filename and command completion isn't enough. Modern Unix shells increasingly come with support for intelligent autocompletion that gets programmed to be command and context sensitive for various commands. A modern shell will not just complete filenames for less, it will (perhaps) complete hostnames for ssh commands, automatically know what arguments some command options take, and so on. The degree that it can do this is limited only by the ingenuity and diligence of the crowd of people who write all the autocompletion rules.

All of this is great when it works, but the problem with smart autocompletion is that it can be too smart, so smart that it wraps back around to 'non-functional'. When shells allowed command specific smart autocompletion, they necessarily allowed people to break autocompletion for some or all of a command. This is unfortunately very frustrating to experience; from the user perspective your conveniently smart shell suddenly reverted to not even having filename completion for no readily apparent reason. When you're used to it, losing even filename completion is really irritating.

(If you are not really aware of the fine details of how command line completion works, this may be an extremely hard to understand experience. From your perspective, your shell works some of the time and fails some of the time and it may seem random as to just why.)

You might think that such bugs would be promptly stamped out. Well, it apparently depends on just what the problem is. If there's an outright bug in the autocompletion rules (or code) for a command, something that spits out errors and large-scale failures, it will indeed probably get fixed fairly rapidly. But the more subtle way to do this is to simply have the autocompletion rules just not handle some of the command's command line options; if such an option is used, the rules just bail out and its argument gets no autocompletion. If the option is relatively infrequently used, this issue can remain unfixed for years.

In related news, people using Bash and the reasonably popular bash-completion package of Bash command-specific completions cannot normally complete the very long filename for the PKCS#11 shared library that you need to give to 'ssh-add -e' and 'ssh-add -s'. As far as I can tell, this issue has been present since 2012 when the ssh-add completion was added. Probably you're going to write a script anyways, just because who wants to even autocomplete the full path to shared libraries on any sort of routine basis.

(I believe the issue is that the completion function simply bails out when it's completing something following a -t, -s, or -e argument.)

PS: It turns out that in Bash you can force a filename-based completion to be done with M-/. I'm not sure I'll remember this the next time I need it (since TAB almost always works), but the Bash people did think ahead here.

PPS: It does kind of impress me that you can write sophisticated autocompletion functions in Bash code and have them run fast enough to make the whole thing work well. Computers really have gotten powerful over the years, which is honestly great to see..

unix/TooSmartShellAutocompleteFailure written at 01:49:02; Add Comment

2016-11-17

Link: Twice the bits, twice the trouble: vulnerabilities induced by migrating to 64-bit platforms

Adrian Colyer's Twice the bits, twice the trouble: vulnerabilities induced by migrating to 64-bit platforms (via) is a very readable and very interesting summary of an ultimately depressing academic paper on the security vulnerabilities that get induced in C code simply by migrating from 32-bit platforms to 64-bit platforms.

In theory I sort of knew about all of this, but it's one thing to vaguely have heard about it and another thing to see handy comparison charts and examples and so on of how relatively innocent C code introduces real vulnerabilities simply when you rebuild it on 64-bit machines and then expose it to attackers.

Here's a depressing quote from the academic paper to finish up with and motivate reading at least Colyer's summary of the whole thing:

Finally, we make use of this systematization and the experience thus gained to uncover 6 previously unknown vulnerabilities in popular software projects, such as the Linux kernel, Chromium, the Boost C++ Libraries and the compression libraries libarchive and zlibā€”all of which have emerged from the migration from 32-bit to 64-bit platforms.

That's, well, unfortunate. But not unexpected, I suppose. Maybe all future languages should not have any limited-range numeric types that can have different sizes on different platforms, even if it's theoretically an attractive idea for 'optimization'.

(I don't know what Rust does here, but Go does have int and uint, which are either 32 or 64 bits depending on the platform.)

links/C64BitMigrationVulnerabilities written at 14:01:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.