Documentation should explain why things are security issues, at least briefly
In my discussion of Apache suexec I
mentioned the apache2-suexec-custom Debian package, which allows
you to change suexec's idea of its docroot and thus use suexec
to run virtual host CGIs that aren't located under
you're using suexec-custom, one of the obvious questions is what
it's safe to set the suexec docroot to. If you read the manpage,
you will hit this paragraph:
Do not set the [suexec] document root to a path that includes users' home directories (like /home or /var) or directories where users can mount removable media. Doing so would create local security issues. Suexec does not allow to set the document root to the root directory /.
This is all that the manpage has to say about this. In fact, this is all of the documentation you get about the security issues involved, period.
Perhaps the people who wrote this documentation felt that the
security issues created here are obvious to everyone. If so, they
were wrong. I at least have no idea what specifically makes including
user home directories dangerous. It seems unlikely to be that users
can create new executables, because if you're doing virtual hosting
and using suexec, you're presumably already giving all of those
different virtual hosting UIDs write access to their subdirectory
/var/www so they can set up their own CGIs. After all, suexec
explicitly requires all of those CGIs and their containing
directories to be owned by the target user, not you. And after that,
what is there that applies to user home directories but not
(It can't be that suexec will run arbitrary programs under user
home directories, because suexec has to be run through Apache and
you should not be telling Apache 'treat anything at all under this
entire general directory hierarchy as a CGI through these URL'. If
you tell Apache that your CGI-BIN directory is
or the like, you have already made a horrible mistake.)
This is a specific example of what is a general failing, namely not explaining why things are security issues. When you don't explain why things are a security problem, you leave people uncertain about what's safe and what isn't. Here, I've been left with no idea about what the important security properties of suexec's docroot actually are. The authors of the manpage have in mind some dangers, but I don't know what they are and as a result I don't know how to avoid them. It's quite possible that this will result in me accidentally configuring Apache and suexec in a subtly insecure way.
The explanation of why things are a security issue doesn't have to be deep and detailed; I don't demand, say, an example of how to exploit an issue. But it should be detailed enough that an outsider can see clearly what they need to avoid and broadly why. If you say 'avoid this general sort of setup', you need to explain what makes that setup dangerous so that people can avoid accidentally introducing a dangerous bit in another setup. Vagueness here doesn't help anyone.
(As a corollary, if you say that a general sort of setup is safe, you should probably explain why that's so. Otherwise you risk people making some small, harmless looking variant of the setup that is in fact not safe because it violates one of the assumptions.)
By the way, all of this applies to local system setup documentation too. If you know why something has to be done or not done in a particular way to preserve security, write it down in specific (even if it seems obvious to you now). Future readers of your documentation will thank you for being clear, and as usual this may well include your future self.
PS: It's possible that you don't know of any specific issues in your program but feel that it's probably not safe to use outside of certain narrow circumstances that you've considered in detail. If so, the documentation should just say this outright. Sysadmins and other people who care about the security properties of your program will appreciate the honesty.
A thought on the apparent popularity of new static typing languages
Like [Elben Shira], I've noticed that despite the fact that there have been an enormous number of new programming languages coming out recently, the overwhelming majority of them are statically typed. [...]
I'm an outsider bystander on all of this, but it strikes me that one possible contributing reason for lots of people creating new statically typed languages is that there is a significant body of academic research that has not yet made it into a popular, mainline programming language. Here I'm thinking primarily of sophisticated type systems and type inference. As long as this situation exists, people tempted to create languages have a clear void that their new modern statically typed language might possibly fill (at least in theory).
(And then they can mingle in some degree of immutability and this and that from other academic research. There's a lot of academic work on statically typed languages that hasn't gotten into popular languages yet. There's also a bunch of people who are grumpy about this lack of popularity, which is another crucial ingredient for creating new languages; see, for example, all of the people who are unhappy at Go for having such a simple and 'primitive' type system in the face of much more powerful ones being out there.)
(Arguably you're not very likely to get significant traction for an advanced statically typed language if so many other ones before you have not been hits, but that's somewhat different in that hope springs eternal. It's the same impulse that keeps people writing new Lisp like languages that they want to be popular.)
PS: I could be totally wrong on this in that maybe there's a pile of good academic research on dynamic languages that's begging to be implemented and made popular. I'd actually like that; it'd mean we have the prospect of significantly better and nicer dynamic languages.
Some notes on Apache's suexec
We've recently been wrestling with suexec in an attempt to get it to do something that it seemed that suexec would do. As a result of that learning experience, I feel like writing down some things about suexec. You may wish to also read the official Apache documentation on suexec, but note that you may have to pay close attention to some of the things that it says (and a few things appear to be outright wrong).
Suexec has two modes:
/~<user>/...CGIs as the particular user involved. This needs no special extra configuration for suexec and simply just happens. Per-user CGIs must be located under a specific subdirectory in the user's Unix home directory, by default
public_html; suexec documentation calls this subdirectory name the userdir.
- Running CGIs for a virtual host as a particular user and group.
This must be configured with the
SuexecUserGroupdirective. All virtual host CGIs must be located under a specific top level directory, by default often
/var/www; suexec documentation calls this directory the docroot.
(Suexec also does various ownership and permissions checks on the CGIs and the directory they are directly in. Those are beyond the scope of these notes.)
The first important thing here is that the suexec docroot and
userdir are not taken from the Apache
settings; instead, they're hard coded into suexec itself. Any time
that suexec logs errors like 'command not in docroot', the docroot
it means is not the Apache
DocumentRoot you've configured. It
pretty much follows that if your Apache settings do not match the
hardcoded suexec settings, suexec will thumb its nose at you.
(Also, the only form of
UserDir directive that will ever work
with suexec is '
UserDir somename'. You cannot use either '
/some/dir' or '
UserDir /some/*/subdir' with suexec. The suexec
documentation notes this.)
The second important thing is that Apache and suexec explicitly
distinguish between the two modes based on the incoming request
itself, not the final paths involved, and these two modes are
exclusive. If you make a request for a CGI via a
the only thing that matters is if the eventual path is under the
user's home directory plus the suexec userdir. If you make a
request to a virtual host with a
SuexecUserGroup directive, the
only thing that matters is if the eventual path is under the suexec
docroot. In particular, you cannot configure a virtual host for
a user, point its
DocumentRoot to that user's userdir, and have
suexec run CGIs. This path would be perfectly acceptable if the
CGIs were invoked via /~user/... URLs, but when invoked for a plain
virtual host, suexec will reject these requests because the paths
aren't under its docroot.
(Mechanically, Apache prefixes the user name it passes to the suexec
binary with a
~ if it is a UserDir request. This is undocumented
behavior reverse engineered from the code, so you shouldn't count
The third important thing is that suexec ignores symlinks in all
of this checking; it uses only the 'real' physical paths, after
symlinks have been traversed. As a result you cannot fool suexec
by, for example, putting symlinks to elsewhere under what it considers
its docroot. However it is fine for user
to include symlinks (as we do); suexec will not
be upset by that.
Normally the suexec docroot and userdir are set when suexec
is compiled and are fixed afterwards, which obviously creates some
problems if you need something different. Debian and Ubuntu provide
a second version of suexec that can look these up at runtime from
a configuration file (this is the apache2-suexec-custom package).
Failing this, well, you'll be arranging (somehow) for all of your
virtual hosts to appear under
/var/www (or at least all of the
ones that need CGIs).
(You can determine the userdir and docroot settings for your
suexec with '
suexec -V' as root. You want
Sidebar: what 'command not in docroot' really means
The suexec error 'command not in docroot' is actually generic and is used for both modes of requests. So what suexec means by 'docroot' here is either the actual docroot, for a virtual host request, or the user's home directory plus the userdir subdirectory, for a /~user/... request. Unfortunately you cannot tell from suexec's log messages whether it was invoked for what it thought was a user home directory request or for a virtual host request; that has to be obtained from the Apache logs.
The check is done by a simple brute force method: first,
to the CGI's directory and do a
chdir() to either
the docroot or the user's home directory plus the userdir and
getcwd(). Compare the two directory paths and fail if
the first is not underneath the second. Because it uses
all symlinks involved in either path will wind up getting fully
Why speeding up (C)Python matters
The simple answer to why speeding up Python matters is that speed creates freedom.
Right now there are unquestionably a lot of things that you can't do in (C)Python because they're too slow. Some of these are large macro things like certain sorts of programs; others are more micro things like use complicated data structures or do versions of things implemented in C. Speeding up Python creates the straightforward freedom to do more and more of these things.
In turn, this creates another freedom, the freedom to write code without having to ask yourself 'is this going to be too slow because of Python?'. Again, this is both at the macro level of what problems you tackle in Python and the micro level of how you attack problems. With speed comes the freedom to write code in whatever way is natural, to use whatever data structure is right for the problem, and so on. You don't have to contort your code to optimize for what your Python environment makes fast (in CPython, C level modules; in PyPy, whatever the JIT can do a good job recognizing and translating).
There is a whole universe of Python code that reasonably experienced CPython programmers know you simply don't write because it would be too slow, create too much garbage, and so on. Speed creates the freedom to use more and more of that universe too, in addition to the Python universe that we already program in. This freedom matters even if none of the programs we write are ever CPU-bound to any particular degree.
(Admittedly there's a peculiar kind of freedom that comes from having a slow language, but that's another entry.)
(This is not particularly novel and I've orbited the core of this thought in earlier entries. I just feel like writing it down explicitly today after writing about why I care about how fast CPython is, partly because a rejoinder to that entry is 'why do we care if CPython isn't really fast?'.)
PC laptop and desktop vendors are now clearly hostile parties
You may have heard of Lenovo's SuperFish incident, where Lenovo destroyed HTTPS security on a number of their laptops by pre-installing root certificates with known private keys. Well, now Dell's done it too, and not just on consumer laptops, and it turns out not just one bad certificate but several. One could rant about Dell here, but there's a broader issue that's now clear:
PC vendors have become hostile parties that you cannot trust.
Dell has a real brand. It sells to businesses, not just consumers. Yet Dell was either perfectly willing to destroy the security of business oriented desktops or sufficiently incompetent to not understand what they were doing, even after SuperFish. And this was not just a little compromise, where a certificate was accidentally included in the trust store, because a Dell program that runs on startup puts the certificate back in even when it's removed. This was deliberate. Dell decided that they were going to shove this certificate down the throat of everyone using their machines. The exact reasons are not relevant to people who have now had their security compromised.
If Dell can do this, anyone can, and they probably will if they haven't already done so. The direct consequence is that all preinstalled vendor Windows setups are now not trustworthy; they must be presumed to come from a hostile party, one that has actively compromised your security. If you can legally reinstall from known good Microsoft install media, you should do that. If you can't, well, you're screwed. And by that I mean that we're all screwed, because without trust in our hardware vendors we have nothing.
Given that Dell was willing to do this to business desktops, I expect that sooner or later someone will find similar vendor malware on preinstalled Windows images on server hardware (if they haven't already). Of course, IPMIs on server hardware are already an area of serious concern (and often security issues all on their own), even before vendors decide to start equipping them with features to 'manage' the host OS for you in the same way that the Dell startup program puts Dell's terrible certificate back even if you remove it.
(Don't assume that you're immune on servers just because you're running Linux instead of Windows. I look forward to the grim meathook future (tm jwz) where server vendors decide to auto-insert their binary kernel modules on boot to be helpful.)
Perhaps my gloomy cloud world future without generic stock servers is not so gloomy after all; if we can't trust generic stock servers anyways, their loss is clearly less significant. Smaller OEMs are probably much less likely to do things like this (for multiple reasons).
Why I care about how fast CPython is
Python is not really a fast language. People who use Python (me included) more or less accept this (or else it would be foolish to write things in the language), but still we'd often like our code to be faster. For a while the usual answer to this has been that you should look into PyPy in order to speed up Python. This is okay as far as it goes, and PyPy certainly can speed up some things, but there are reasons to still care about CPython's speed and wish for it to be faster.
The simple way of putting the big reason is to say that CPython is
the universal solvent of Python. To take one example, Apache's
mod_wsgi uses CPython; if you want to use it to deploy a WSGI
application in a relatively simple, hands-off way, you're stuck
with however fast CPython is. Another way that CPython is a universal
solvent is that CPython is more or less everywhere; most Unix systems
/usr/bin/python, for example, and it's going to be some
version of CPython. Finally, CPython is what most people develop
with, test against, write deployment documentation for, and so on;
this is both an issue of whether a package will work at all and an
issue of whether it's doing that defeats much of PyPy's speedups.
Thus, speeding up CPython speeds up 'all' Python in a way that improvements to PyPy seem unlikely to. Maybe in the future PyPy will be so pervasive (and so much a drop in replacement for CPython) that this changes, but that doesn't seem likely to happen any time soon (especially since PyPy doesn't yet support Python 3 and that's where the Python world is moving).
(Some people will say that speeding up things like Django web apps is unimportant, because web apps mostly don't use CPU anyways but instead wait for IO and so on. I disagree with this view on Python performance in general, but specifically for Django web apps it can be useful if your app uses less CPU in order to free up more of it for other things, and that's what 'going fast' translates to.)
I should find some good small packages for templates and forms for CGIs
I recently wrote about how we shouldn't be making up new (web) templating packages but instead should just use existing ones. This leads me to admit that I have a little problem here with Python, in that I don't currently have one of these that I'd use. Well, sort of.
If I was to write another relatively substantial web application, I'd unquestionably use Django (and thus I'd use Django's templating system and form handling). It's what I know, it works, and we already have one Django web app so adding a second is relatively low overhead. But the problem with Django (and other big systems) is that their deployment story is kind of heavyweight and a pain in the ass. We have a number of small, simple things that are easiest to handle as just CGIs, and we'll probably get more.
(This may be over-estimating the hassles of deploying a second mod_wsgi based thing when we already have one, but wrangling Django kind of makes me grumpy in the first place.)
This means that I should really find an existing Python templating system that is right-sized for use in CGIs, meaning that it is not too big itself, does not have big dependencies, starts up relatively fast, and ideally can take its templates from embedded strings in the program instead of loading them from the filesystem. I haven't previously looked in this area (partly because Django met what I thought of as all of my needs here), so I'm not familiar with what's available.
(For simple CGI-like things, embedded the templates in the CGI's Python code makes for easier deployments, by which I mean that the whole CGI will be just one file and I'll copy it around.)
In the same spirit of 'don't roll your own half-solution when other people have done better ones', I should also get an existing package to handle forms for my CGIs. This is likely to interact with the choice of templating system, since you often want to use your template system to automatically render forms in clever ways instead of basically embedding fixed HTML for them.
Probably I want to start my investigation with, say, the Python wiki's entry on web frameworks. I'm going to not mention any project names here for now, since what comes to mind is basically influenced by the project's PR (or lack of it), not any actual knowledge on my part.
PS: This walks back from my views on templates versus simple HTML. I still believe them, but I also believe that it's probably not worth fighting city hall in most cases. A full template system may be vast overkill for many things, but there's a great virtue in standardizing on a single solution that can be learned once and is then familiar to people.
(Templates also have the virtue that the HTML and the fixed text content are both obvious, so it's easy for people to find it and then make simple brute force modification because, for example, we've decided to change some names we use to refer to our systems in order to confuse users less. You don't even really need to know the templating system for this, you just have to be able to find the text.)
What I think I want out of autocompletion in GNU Emacs (for Go coding)
I mentioned a while back that I had set up autocompletion in GNU Emacs for Go, using gocode and the auto-complete Emacs package. I also mentioned that I wasn't sure if I liked autocompletion and was going to stick with it. Well, the verdict is in for now; I found it too annoying and I wound up turning it off. However, I still kind of miss it. Thinking about what I miss and what made me hate it enough to turn it off has led me to what I think I want out of autocompletion.
Why I turned autocompletion off is that it kept stealing my keystrokes (in order to do the wrong autocompletion); cursor keys, the return key, and I think even sometimes the space bar. I type fast and I type ahead, so I absolutely, utterly hate having the sequence of what I'm typing be derailed because autocompletion decided to grab a cursor motion or a return or whatever. Unless I go out of my way, I want what I type at the keyboard to actually be what shows up in the file I'm editing. At the same time, the prompting and information that autocompletion gave me was genuinely useful; it was a great way to not have to remember the full names of things in Go packages and so on.
Given that I liked the information display, I don't want all of (auto)completion to be deferred until I use a special key sequence like C-M-i. If I spent a lot of time in GNU Emacs I might be able to train myself to hit that by reflex, but with my more casual use it'd just insure that I mostly never used completion at all. But I don't want any actual completing of things to happen until I hit a key to start it (and once I hit the key, it's fine if autocompletion steals my cursor keys and return key and so on).
So in short what I want from autocompletion is immediate information on possible completions coupled with deferred actual completion until I take some active step to start the completion process. This is fairly similar to the completion model I'm happy with in Unix shells, where nothing starts getting filled in until you hit TAB.
(Defering only actual completion doesn't appear to be possible in auto-complete. I can't entirely blame the package, because what I'm calling an information display is what it thinks of as a completion menu and completion prompt.)
Part of my irritation with autocompletion is specific to the Go
autocompletion mode provided by gocode. For instance, in Go I
don't want to have completion happen when I'm typing in language
func; I find it both distracting and
not useful. Completion is for things that I might have to look up;
if I'm typing a keyword, that is not the case.
(This completion of keywords is especially irritating because it's
blind to context. If I start typing 'pa' on a new line in a function
body, I'll still get offered '
package' as a possible completion
despite that clearly not being correct or even valid. Gocode is context
aware in general, in that it does things like offer local variables as
PS: That part of my issues are with gocode itself suggests that even switching to vim wouldn't entirely help.
What modern version control systems are
If you read about new version control systems these days, it's very common to see them put forward as essentially the expression or manifestation of mathematics. Maybe it's graph theory, maybe it's patch theory, but the basic idea is that you build up some formal model of patching or version control and then build a VCS system that implements it. This is not restricted to recent VCSes, either; version control as a whole has long had a focus on formally correct operations (and on avoiding operations that were not formally correct).
It is my new belief that this is a terrible misunderstanding of the true role of a VCS, or at least a usable VCS that is intended for general use. Put simply, in practice a VCS is the user interface to the formal mathematics of version control, not the actual embodiment of those mathematics. The job of a good VCS is to sit between the fallible, normal user (who does not operate in the domain of formal math) and the underlying formal math, working away to convert what the user does to the math and what the math says to what the user can understand and use.
As a user interface, a VCS must live in the squishy world of human factors, not the pure world of mathematics. That's its job; it's there to make the mathematics widely usable. This is going to frequently mean 'compromising' that mathematical purity, by which we really mean 'translating what the user wants to do into good mathematics'. I put 'compromise' in quotes here because this is only a compromise if you really think that the user should always directly express correct mathematics.
(We know for sure that users will not always do so, so the only way to pretend otherwise is to spit out error messages any time what the user attempts to do is incorrect mathematics (including the error message of 'no such operation').)
Does this mean that the mathematics is unimportant? Not at all, any more than your skeleton is unimportant in determining your shape. The underlying mathematics can and should shape the user experience that the VCS puts forward (and so different formal models of version control will produce VCSes with different feels). After all, one job of a UI is to steer users into doing the right thing by making it the easy default, and the 'right thing' here is partly determined by the specific math.
PS: The exception to this view of VCSes is a VCS written as an academic exercise to prove that a particular set of version control mathematics can actually be implemented and work. This software is no more intended (or suitable) for general use than any other software from academic research.
VCS bisection steps should always be reversible
So this happened:
@thatcks: I think I just ruined my bisect run with one errant 'hg bisect --bad', because I can't see a way to recover from it in the Mercurial docs.
This is my extremely angry face. Why the hell won't Mercurial give me a list of the bisect operations I did? Then I could fix things.
Instead I appear to have just lost hours of grinding recompilation to a UI mistake. And Mercurial is supposed to be the friendly VCS.
VCS bisection is in general a great thing, but it's also a quite mechanical, repetitive process. Any time you have a repetitive process that's done by people, you introduce the very real possibility of error; when you do the same thing five times in a row, it's very easy to accidentally do it the sixth time. Or to just know that you want the same command as the time before and simply recall it out of your shell's command history except that nope, your reflexes were a bit fast off the mark there.
(It's great when bisection can be fully automated but there are plenty of times when it can't because one or more of the steps requires human intervention to run a test, decide if the result is correct, or the like. Then you have a human performing a series of steps over and over again but they're supposed to do different things at the end step. We should all know how that one goes by now.)
So inevitably, sooner or later people are going to make a mistake during the bisection process. They're going to reflexively mark the point under testing as good when it's actually bad, or mark it as bad when they just intended to skip it, or all of the other variants. It follows directly that a good bisection system that's designed for real people should provide ways to recover from this, to say 'whoops, no, I was wrong, undo that and go back a step' (ideally many steps, all the way back to the start). Bisection systems should also provide a log, so that you can see both what you did and the specific versions you marked in various ways. And they should document this clearly, of course, because stressed out people who have just flubbed a multi-hour bisection are not very good at carefully reading through three or four different sections of your manual and reasoning out what bits they need to combine, if it's even possible.
Of course, this sort of thing is not strictly speaking necessary. Bisection works just fine without it, provided that people don't make mistakes, and if people make mistakes they can just redo their bisection run again from the start. A bisection system with no log and no undo has a pleasantly mathematical sort of minimalism. It's just not humane, as in 'something that is intended to be used by actual humans and thus to cope with their foibles and mistakes'.
Overall, I suppose I shouldn't be surprised. Most version control systems are heavily into mathematical perfection and 'people should just do it right' in general.
(This is a terrible misunderstanding but that's another entry.)
Increasingly, I no longer solidly and fully know Python
There once was a time where I felt that I more or less solidly knew and understood Python. Oh, certainly there were dark corners that I wasn't aware of or that I knew little or nothing about, but as far as the broad language went I felt that I could say that I knew it. Whether or not this was a correct belief, it fed my confidence in both writing and reading Python code. Of course the wheels sort of started to come off this relatively early, but still I had that feeling.
For various reasons, it's clear to me that those days are more and more over. Python is an increasingly complicated language (never mind the standard library) and I have not been keeping up with its growth. With Python 3 I'm basically hopeless; I haven't been attempting to follow it at all, which means that there's whole large areas of new stuff that I have no idea about. But even in Python 2 I've fallen out of touch with new areas of the language and core areas of the standard library. Even when I know they're there, I don't know the detailed insides of how they work in the way that I have a relatively decent knowledge of, say, the iterator protocol.
(One example here is context managers and the "with" statement.)
I'm pretty sure that this has been part of my cooling and increased ambivalence with Python. Python has advanced and I haven't kept up with it; there's a distance there now that I didn't feel before. Python 3 is an especially large case because it feels that lots has changed and learning all about it will be a large amount of work. Part of me wonders if maybe Python (at least Python 3) is now simply too large for me to really know in the way that I used to.
You might ask if actually knowing a language this way is even important. My answer is that it is for me, because knowing a language is part of how I convince myself that I'm not just writing Fortran in it (to borrow the aphorism that you can write Fortran in any language). The less I know a language, the less I'm probably writing reasonably idiomatic code in it and the more I'm writing bad code from some other language. This especially matters for me and Python 3, because if I'm going to write Python 3 code, I want to really write in Python 3; otherwise, what's the point?
(I don't have any answers here, in part because of a circular dependency issue between my enthusiasm for writing stuff in Python and my enthusiasm for coming back up to full speed on Python (2 or 3).)