2012-11-10
Why Unix doesn't have user-changeable namespaces
Today I was reading Plan 9 mounts and dependency injection, and in a footnote ran across this:
For the longest time, Linux did not provide per-process mount namespaces, and even today this feature is not available to unprivileged users — Plan 9, in contrast, had this feature available from the very beginning to all users.
As it happens there's an excellent reason why Unix (not just Linux)
doesn't support this and why Plan 9 does, the same reason that
chroot()
is a privileged system call. It's this: Plan 9 does not have
setuid, and Unix does.
Imagine that Unix had this feature and still had setuid, and you would
like root privileges. No problem; make a custom namespace for /etc
that has a version of /etc/shadow
, /etc/group
, and /etc/sudoers
that have known passwords and list you as authorized. Now run sudo
.
Done.
In Unix, the practical security of setuid programs relies on control
of the filesystem. There is a huge raft of ways to subvert almost all
setuid programs if you can control the contents of all files that they
access, and that is exactly what unprivileged, per-process namespaces
give you. While it might be theoretically possible to still be secure in
this environment (with a tiny bit of kernel support), no setuid program
today is going to be for the simple reason that a setuid Unix program
is entirely entitled to believe that /etc/shadow
is under the system
administrator's control, because that is the Unix permission model.
(I'm far from convinced that it's even theoretically possible for setuid programs to be secure in this environment, but I'm not totally sure it's impossible so I'm being conservative. What is sure is that any Unix system with user-controlled namespaces would need to totally rewrite all setuid programs.)
You could try to fix this by saying that setuid processes (and all processes that they start) don't see the user's customized namespace but some sort of standardized system-wide namespace. But this is a terrible solution in practice, one that's going to endlessly surprise users, because it means that setuid programs can wind up seeing an entirely different view of the system than you do. Since users do not necessarily know what's setuid and what isn't (and it changes over time anyways), they're going to experience an environment where some things work and others don't, apparently at random.
The end result of all of this is that all Unix system calls that rearrange the namespace that programs see must be privileged, because they allow you to compromise system security.
Plan 9 doesn't have setuid (because it has a real distributed filesystem), so it has none of these problems. You can rearrange your namespace freely because there's nothing you can subvert with a rearranged namespace.
Sidebar: the direct way to subvert almost all setuid programs with this
Modifying /etc/sudoers
or /etc/shadow
is simple but it's not the
most direct way to subvert setuid programs. Almost all setuid programs
on a modern system are dynamically linked, and a dynamically linked
program loads code from outside files (both the initial runtime loader
and various libraries). So build your own hacked version of the runtime
loader that ignores the program and does whatever you want, change the
namespace to put it on top of the normal loader's filename, and run
a setuid program. Any precautions the program itself takes are now
completely irrelevant; you have control before its code even starts
executing.
This is why I say you need a tiny bit of kernel support to have even a theoretical chance of secure setuid programs in the face of such changeable namespaces; the kernel has to block this somehow for setuid programs or the game is over before it even starts.
Sidebar: shooting down a potential limited solution
One potential bandaid to preserve some unprivileged namespace changing
would be to say that you can only change the binding of 'mount points'
that you own; you can change what $HOME/bin
is, but not /etc
. The
argument for this is that you can already change $HOME/bin
around.
However, I'm not convinced that this is secure in this form. The problem
with a straightforward implementation is that you can cause files owned
by other people to appear at arbitrary points under your own control
(even if those files are not on the same filesystem as your directories
and files). This is exactly the kind of thing that Linux has been
busy preventing lately, which is a strong
argument that it is dangerous in practice.
The apparently safe, fully restricted version requires you to own both the source and the target of the namespace operation. I'm not convinced that this is particularly useful.