2014-03-16
You don't have to reboot the system if init dies
One of the thing that makes PID 1 special on many systems is that if it ever exits or dies for any reason, the system will reboot. This behavior was introduced by BSD Unix (V7 ignored the possibility) and makes a certain amount of sense; init is crucial both for reaping orphan processes and restarting serial port logins. If it goes away, rebooting the system is an easy way to hopefully fix the situation.
However, this behavior is not set in stone. There are several
alternatives. The first would be to simply have the kernel cope with
no PID 1, handling and reaping orphan processes itself internally in
some way (and possibly providing some special way for user level to
restart a new PID 1). The second is for the kernel to re-exec init as
PID 1 if necessary. If PID 1 exits, the kernel would not tear down its
process but instead act as if it had done an exec. Ideally this
would be accompanied by some way for init to store and then reload
important state. Done right this actually provides a great way for init
to transition itself into a new version; just record the current state,
exit, and let the kernel re-exec the new init binary.
Perhaps the second behavior sounds odd and crazy. Then I should probably tell you that this is current Solaris behavior and nothing seems to have exploded as a result. In other words we already have an existence proof that it's possible to change the semantics of PID 1 exiting, so we could adopt it elsewhere if desired.
Apart from the innate conservatism of Unixes, I think one reason that other Unixes haven't done this is that it's almost never necessary anyways. Since init not exiting is so crucial today people have devoted a lot of engineering effort to make sure that it doesn't happen and have been quite successful at it. Even radically different and complex systems like Upstart and systemd have been extremely stable this way in practice.
(Also, this 're-exec init on failure' behavior needs cooperation from your init, both so that init doesn't always start trying to boot the system when it's executed and so that it journals state periodically so that a new init can pick it up again. This makes it easier to add in certain sorts of Unixes, ie the ones where one team can control both kernel changes and init changes.)
2014-02-14
The good and bad of the System V init system
The good of System V init is that it gave us several big improvements
over what came before in V7 and BSD Unix. First
and largest, it modularized the boot process; instead of a monolithic
shell script (or two, if you counted /etc/rc.local) you had a
collection of little ones, one for each separate service. This alone is
a massive win and enabled all sorts of things that we take for granted
today (for example, casually stopping or starting a service).
The other big change is that System V init turned the entire work of
init from a collection of hacks into a systematic and generalized
thing. It formally defined runlevels and runlevel transitions and
created in /etc/inittab a general mechanism for specifying all of the
work init did, from booting to running gettys on serial lines (or
running anything) to how to reboot the system. System V init removed
the magic and hardcoding in favour of transparency. Things like reboot
stopped killing processes and making special system calls and turned
into 'tell init to go into runlevel ...', and then /etc/inittab and
runlevel transitions said what to do so that this actually rebooted the
machine. In the process it added a way to specify how services shut
down.
(Simply defining runlevels formally meant that other systems could now tell what state the system was in and behave differently between eg single user mode and multiuser mode.)
The very general and high level view of the bad of the System V init system is that fundamentally all it does is blindly run shell scripts (and that only when the runlevel changes). This creates all sorts of lower-level consequences:
- SysV init doesn't know what services are even theoretically running
right now, much less which ones of them might have failed since
they were started.
- It doesn't know what processes are associated with what services.
Even individual init scripts don't know this reliably, especially
for modern multi-process services.
- Even init scripts themselves can't be certain what the state of
their service is. They must resort to ad hoc approaches like PID
files, flag files for 'did someone run <script> start at some time
this boot', checking process listings, and so on. These can
misfire.
- Services are restarted in a different environment than how they
are started on boot. Often
contamination leaks in to a restarted service (in the form of
stray environment variables and other things).
- Output from services being started is not logged or captured in any
systematic way. Many init scripts simply throw it away and there's
certainly no official proper place to put it.
- The ordering of service starts is entirely linear, by explicit
specification and guarantee. System V init explicitly says 'I
start things in the following order'. There is no parallelism.
- Services are only started and stopped when the runlevel changes.
There is no support for starting services on demand, on events,
or when their prerequisites become ready (or stopping them when
a prerequisite is being shut down).
- System V init has no idea of dependencies and thus no way for
services to declare 'if X is restarted I need to be restarted too'
or 'don't start me until X declares itself ready'.
- There is no provision for restarting services on failure.
Technically you can give your service a direct
/etc/inittabentry (if it doesn't background itself) but then you move it outside of what people consider 'the init system' and lose everything associated with a regular init script. - Since init scripts are shell scripts, they're essentially
impossible for programs to analyse to determine various things about
them.
- It's both hard and system-dependent to write a completely correct
init script (and many init scripts are mostly boilerplate). As a
result it's common for init scripts to not be completely correct.
- Init scripts are not lightweight things in general, either in reading them to understand them or in executing them to do things.
In theory you can try to fix many of these issues by adding workarounds in your standard init script functionality. Your 'standard' init script utilities would capture all daemon output in a documented place and way, start everything in cgroups (on Linux) or containers to track processes reliably, have support for restarting services on failure, carefully scrub every last bit of the environment on restarts, monitor things even after start, et cetera et cetera, and then you would insist that absolutely every init script use your utilities and only your utilities. In practice nothing like this has ever worked in practice (people always show up with init scripts that have bugs, take shortcuts, or do not even try to use your complex 'standard' init utilities) and the result would not particularly be a 'System V init system' except in a fairly loose sense.
(It would also make each init script do even more work and run even more slowly than they do now.)
2014-02-13
Init's (historical) roles
Historically (by which I mean since at least V6 Unix), init aka PID 1 has had three and then four roles:
- It inherits orphan processes, ie processes that have had their regular
parent exit. Doing this almost certainly simplified a bunch of V7
kernel code because it meant that every process has a parent process.
- Starting up the user level of Unix on boot. Originally this was done
by running a monolithic shell script, as can still be sort of
seen in OpenBSD. System V init modularized and generalized it
into the multi-file form.
- Starting, managing, and restarting the
gettyprocesses for the console and (other) serial lines. System V init generalized this so that init started and restarted whatever you told it to via entries in/etc/inittab. - Shutting down the user level of the system and rebooting. This
role first appeared in System V init, using the modularity that
it had introduced for booting. Modern BSDs also give
initresponsibility for rebooting (and it will run a shell script as part of this), but as late as 4.4 BSDreboot(8)did almost all of the work itself and there was no concept of running a shell script to bring services down in any orderly way;reboot(8)just killed everything in sight.
(Really. You can read the 4.4 BSD reboot(8) source
if you want, it's not long. The violence starts at the 'kill(-1,
SIGTERM)'.)
Putting the three (and then four) roles on the shoulders of a single
process is likely due to both conservation of resources in early
Unixes (given that they ran in very limited environments they likely
didn't want to take up memory with extra programs) and simple least
complexity and effort. Once you had init as the inheritor of
orphan processes you might as well make it do all the other roles
since it was already there. Why throw in additional programs without
a good need? It probably helped that even in V7 the other two roles
were pretty simple and minimal, per eg the V7 /etc/rc.
As a historical note, it was BSD Unix that decided that init was
so crucial that the system should be rebooted if it ever exited.
V7 Unix will probably get into an odd state if init ever exits
but as far as I can tell from the kernel source PID 1 is not treated
specially as far as exiting goes; as a practical matter V7 Unix
just assumes it will never happen. Even what happens if /etc/init
can't be executed on boot is not strictly a kernel thing in V7.
(In the initial environment of BSD, this decision was probably doubly
correct. Even if you never have to deal with any orphaned processes or
the kernel cleaned them up itself (let's wave our hands aggressively
here), losing init means that getty processes will not be restarted on
serial lines when people log out, which over time makes it impossible
for anyone to log in. Of course in the modern era of networked machines
this is no longer such an issue and you probably care a lot more about
sshd than about gettys.)
Some modern init systems have split some or most of these roles out
from PID 1. Solaris, for example, moved everything except the first
role to separate processes (the SMF stuff runs in svc.startd et
al and getty processes are handled through ttymon and sac).
2014-01-29
One cause of Linux's popularity among Unixes
Regardless of what you feel about it, I think that most people can agree that Linux is winning whatever is left of the Unix wars. It isn't the only Unix left but for a fairly long time now it's been the leading one, often the default choice. You can attribute this to good PR if you want to, but I happen to think that that's a mistake. Linux has attracted people partly because it has genuine attractions.
In light of my rant about the waste inherent in building packages yourself, it has struck me that one such advantage has been Linux's general wide availability of packages. As I mentioned, system administrators really appreciate not having to spend their limited time compiling ordinary things and Linux is very good at that; most major Linux distributions will give you a precompiled version of almost any standard Unix program you could want (or at least a precompiled program to do almost any standard job). I don't think it's an accident that one of the long term favorite distributions is Debian, which has one of the biggest package archives going.
(Prepackaged software is not good enough if you need a specific version of something compiled in a specific way. But for many Unix machines you just need a working and reasonably current version of whatever. And there are a lot of packages on many machines where the exact details are not crucial.)
At this point I have to mention FreeBSD's ports collection, which even comes in precompiled packages; logically one would expect this to be just as good a selling point for FreeBSD as a Linux distribution with a similar package selection. However, I'm not convinced that it is in practice, and for why I'll point at the name: 'ports'. Well, more what the name means or is perceived as meaning.
Debian's vast collection of packages are all Debian together. Some of
them are more important than others, but they are all part of the Debian
whole. The dividing line between really important and less and less
important is both relatively opaque to outsiders and somewhat subject
to debate; it can get pushed back and forth if people want. By contrast
at least to an outsider FreeBSD has a relatively sharp dividing line;
you have FreeBSD core and then you have ports. Ports is clearly not the
same and to drive the point home they install things into /usr/local
instead of /usr.
FreeBSD is probably at least as committed to ports as, say, Ubuntu
is to packages in universe.
But I'm not convinced that non-FreeBSD sysadmins who are looking
at the situation really believe down in their guts that FreeBSD is
as committed to ports as Ubuntu is to main (even if it is, and I
don't believe it's that committed to all ports). And I think that
that makes a difference.
(I am talking about non-FreeBSD sysadmins here because these are often the people who are making decisions about whether or not to use FreeBSD. Also that's the situation I'm in myself, so I don't know how it looks from the inside but I can talk with at least a little bit of authority how it feels from the outside.)
PS: I haven't mentioned commercial Unixes here because oh boy package availability on commercial Unixes, that's a funny joke. Provided by third parties at best. Red Hat Enterprise Linux is sort of in the same boat but at least they woke up and I think started doing something with EPEL.
2014-01-25
The origin of RCS (the version control system)
Let's start with the tweets:
@johnregehr: quick! who remembers a revision control system before RCS?
@thatcks: SCCS. Bonus trivia: RCS exists because getting SCCS required paying AT&T extra money and universities don't have that.
I've mentioned this before in passing, but I might as well tell the full story (or at least folklore) here. The disclaimer is that this is the story as I heard it, not definitive history. A lot of Unix history goes around as folklore, or went around in the days when people were still passing Unix history around.
Version 7 Unix didn't ship with any version control system at all. SCCS, the first Unix version control system, first started appearing in in AT&T's PWB. PWB and things unique to it were not covered by AT&T's generous V7 university source code licenses; where you could get them at all (and I'm not sure outsiders could in any form before System III) AT&T apparently wanted extra money for them. Universities of course did not feel like paying extra money for niceties and anyways they weren't using PWB, they were using V7 or later BSD and so would have had to port anything they wanted from PWB to BSD. All of this meant that if you were using Unix at a university in the late 1970s and early 1980s you could look wistfully on SCCS from afar but you almost certainly could not get a copy.
Which is where RCS comes from. As the Wikipedia entry helpfully mentions, Walter Tichy wrote the initial version of RCS while at Purdue, where he had access to V7 and BSD but I assume not PWB. As the folklore goes, he wanted version control, could not get SCCS, and so wrote his own. As one did with Unix programs in an academic environment at the time, he released it for general use. Since quite a lot of universities were in a similar position of wanting some sort of version control on their Unix systems but not having SCCS, it got widely adopted.
As I mentioned before RCS required
(V7) Unix source code (specifically for diff), which might strike
you as odd if Tichy wrote it from scratch. As I remember the story,
RCS required some additional diff features and and in an early
1980s university environment with V7 and BSD source code the easiest
way for Tichy to get that was to modify the BSD diff to support
what RCS needed. The reason you can get RCS widely today is GNU
diff, which is both free and directly supports the features that
RCS needs without any patching.
(I think that one of the diff features RCS needed was a three-way diff.
I believe it may have also wanted a somewhat different format of diff
output, given GNU diff's -n argument.)
(The official RCS home page has some early RCS papers online.)
2013-12-31
Two uses of fmt
The venerable fmt program is not something that I normally think of as
a command that I use in my scripts or on the fly pipelines; I usually
think of it more as, say, something that I use to reflow paragraphs in
vi. But it has quietly been making its way into an increasing number
of them because it turns out that fmt is the easy and lazy way to do
two symmetrical things: to word-split lines and to merge 'one word per
line' output back to a single line.
Word splitting is straightforward:
somecmd | fmt -1
Every version of fmt that I've seen accepts this and does the right
thing; you get one word per line after fmt.
Word joining is much more annoying and it's all because of GNU
coreutils. Real versions of fmt accept very large format widths:
/bin/ls -1 | fmt -9999
Unfortunately the GNU coreutils version of fmt has a maximum width
of 2500 characters. Worse, it has a 'goal' width that defaults to 93%
of the actual width, so if you're worried about getting close to that
limit you need to use:
/bin/ls -1 | fmt -2500 -g2500
In practice I usually use 'fmt -999' in command pipelines because my
actual output line is going to be nowhere near 999 characters to start
with.
(Usually when I'm doing word merging it's because I'm going to take the one line that results and paste it to something else, which imposes a relatively modest line length limit in practice.)
What this points out is that fmt is not really the ideal solution
to this (and in fact the FreeBSD version of fmt also has oddities,
such the man page's
description of behavior around the -n switch). The traditional Unix
solution to these problems is tr, using it to either turn spaces to
newlines or newlines to spaces. The problem for me in practice is that
to use tr I need to remember or re-derive the octal value of newline
(it's \012, by the way) and that is just a bit too much hassle. So I
use fmt, vague warts and all.
(The other drawback of tr is that 'tr " " "\012"' will have a
trailing space and no final newline. Usually this is not a big deal.)
Actually in writing this I've discovered that I'm behind the times. All
of the versions of tr that I use today will accept \n instead of
the octal literal. Either there was a day when this wasn't true or I
just never read far enough in the tr manpage (and had it stick) to
notice that particular feature. (I'm probably still going to keep on
using fmt, though.)
2013-12-30
My growing entanglement into vi
It started with vi becoming my sysadmin's editor, the editor that I used for quick
edits because it was everywhere, worked in minimal environments, and
it started fast. But of course it didn't stop there. Any good tool has
a virtuous circle where more use makes you more familiar with it and
thus makes it easier to use so you use it more; vi goes well beyond
that in terms of rewarding extended use. Vi's march into my
editing life has not been fast but it's feeling more and more relentless
as time goes by, especially when I do things like specifically configure
git to use vi instead of my normal default.
I'm not using vi pervasively quite yet, but increasingly my major
holdout (writing email in my full email environment) feels a little bit
awkward.
(My normal default $EDITOR is a script that tries to intelligently
pick the editor to use based on my current environment based on
things like whether or not I have X available.)
This has not fundamentally changed my view of vi as a whole (it
remains not my favorite editor). I am simply being seduced by
convenience and familiarity, and running into the limits and issues
in my major other editor. Not that vi
is bad (rather the contrary), but I still miss things from my other
editors and often would sort of prefer to be using them.
(Possibly this attachment to my major other editor is just emotion speaking.)
While I've been additional learning vi (well, vim) features slowly
over time, I still have not really attempted to become solidly familiar
with Vim's advancements over the core vi editing commands (I'm going
to wave my hands about the reasons why, but see above about vi still
not being my favorite editor). If I get more seriously into vi, and it
seems inevitable that I will, I should probably change that. My obvious
weak areas are the areas where vi itself is weak: working fluidly
with multiple files and also with split screens for editing two files
simultaneously. Mastering doing this in Vim would remove one significant
reason that I revert to other editors.
(I will probably always edit Python, C, and Go code in GNU Emacs when I have a choice. But there is a fair amount of other multi-file work that would at least be more convenient if I knew what I was really doing in Vim.)
I know that Vim has a universe of advanced text movement and text manipulation commands but I'm honestly not sure that I feel much interest in learning them. The mere fact that there is a universe of them is kind of daunting and I have no confidence that they'd speed up the sort of editing work that I do very much. Probably some of them would, so I suppose I should at least glance over the list to see if anything stands out.
(This has come out more rambling and thinking aloud than I thought it
would. I do think that there's something interesting about how vi has
wormed its way into my computing life as more and more the editor I
reach for, but I don't have the words for it right now.)
2013-12-15
Making large selections in xterm (and urxvt and Gnome Terminal)
Suppose that you have a large chunk of output in a terminal window,
specifically more than a full screen's worth, and you want to copy it
into an email message, text file, or however else you may be logging it
for the record. As I knew vaguely but had never really read up on or
used until very recently, it turns out that there is a convenient way to
do this in xterm. Specifically, this is what the right mouse button is
for; it extends the selection from where it is until the current point.
So in xterm what you do to select a huge selection is select a bit
right at one end (the start or the end), scroll to the other end, and
carefully hit the right mouse button at where you want the selection to
end. The selection is instantly extended. You can do this several times
if you want, extending the selection each time. Odder and less easily
controlled things happen if you hit the right mouse button somewhere
inside the selection.
This doesn't work in Gnome Terminal. Instead what you have to do
is start the selection with the left mouse button and while making
it, drag the mouse cursor to the edge of the window (or outside the
window). G-T will scroll things for you, extending the selection in
the process. G-T's scrolling is sufficiently rapid that this is a
reasonably convenient and intuitive process, arguably better than
xterm's.
Urxvt gives you both options; you can extend the selection explicitly
with the right mouse button or let urxvt scroll things for you in the
same way as Gnome Terminal. The one drawback is that urxvt by default
scrolls inconveniently slowly (and there doesn't seem to be any way to
control this from what I can see in the manual). You can scroll with a
mouse scrollwheel and it works reasonably well although a bit jumpily in
my quick test.
(Xterm doesn't scroll at all if you drag the mouse out of the window while you make a selection.)
In a brief test, KDE's Konsole works the same way as Gnome Terminal. I
suspect that this is going to be the common behavior of more or less all
modern 'smart' terminal emulators because it makes the most sense and
it's relatively discoverable (unlike the right mouse button in xterm).
2013-12-06
The three levels of read-only NFS mounts
It's sometimes useful to understand that there are three ways that an NFS mounted filesystem can be 'read-only'. Let's call them three levels:
- You can mount the NFS filesystem read-only on the client. The client
kernel will then enforce this, disallowing write actions and so on.
These days this is generally mostly handled in high level VFS code, since it's
common behavior across filesystems.
As with all remote filesystems, this read-only status is purely local to your client machine. Your machine doesn't get to order the NFS server not to make any changes on the filesystem (that would be laughable) so the NFS server is perfectly entitled to allow the filesystem to change underneath you and to have other clients mount it read-write (and write to it). If NFS is working right, you will see those changes at some point.
- The server can export the NFS filesystem read-only (either to you
or just in general). The NFS server code will then disallow all
write actions that clients send it, returning an appropriate 'read
only filesystem' error to errant clients (if any). Even if the NFS
mount is exported read-only to all clients, it's still valid for the
exported filesystem to be changed locally on the NFS server.
(As far as I know, whether or not the NFS export is read-only is invisible to the client. It's purely something internal to the server and can even change on the fly.)
- On the server you can mount the exported filesystem read-only (or
otherwise set it that way). On competent NFS servers this disallows
all writes to the filesystem, regardless of whether they're NFS
or local and regardless of whether the filesystem was exported
read-only by the NFS server.
(On competent NFS servers, all NFS server operations on the exported filesystem go through the VFS et al and so have the standard handling of read-only mounts applied to them automatically.)
These can certainly be stacked on top of each other (a read-only server filesystem, NFS exported as read-only and mounted as read-only on clients) but they don't have to be. For instance you can NFS export filesystems as read-only but mount them read-write on clients (we do this here for complex reasons).
Now let's talk about atime and atime updates. In NFS, atime updates are the responsibility of the server, not the clients. More specifically they are generally the responsibility of the underlying server filesystem code or VFS, not specifically the NFS server code, and as such they can happen when you read data through a read-only NFS mount or even a read-only NFS export. The NFS clients asks to read data, the NFS server code makes a general VFS 'get me data' call, and as a side effect of this the VFS or the filesystem updates the atime (if atime updates are enabled at all).
(This implies that not all client reads necessarily update the server atime, because a client may satisfy a read from its own file cache instead of going to the server.)
If you think about it this is actually a feature. If you have atime enabled on a read-write filesystem mount, you have told the (server) kernel that you want to know when people read data from the filesystem and lo, this is exactly what you are getting. The read-only NFS export is just to tell the NFS server that it should not allow people to do 'write' VFS operations.
(Since you can export the same filesystem read-write to some clients and read-only to others, suppressing atime updates on read-only NFS exports could also produce odd effects. Read a file from client A and the atime updates, read the file from client B and it doesn't. And all because you didn't trust client B enough to let it actually make (filesystem level) changes to your valuable filesystem.)
Sidebar: NFS exporting of read-only filesystems
You might think that the NFS export process should notice when it's
exporting a read-only filesystem as theoretically read-write and
silently change it to read-only for you. One of the problems with this
is that on many systems it's possible to switch filesystems back and
forth between read-only and read-write status through various mechanisms
(not just mount). In practice you might as well let the NFS server
accept the write operations and have the VFS then reject them; the
outcome is the same while the system is simpler and behaves better in
the face of various things happening.
2013-12-03
The three faces of sudo
For reasons beyond the scope of this entry I've recently been thinking
about my attitudes towards sudo. Sudo is a complex program with a lot
of options and several different ways of using it, and in the process
of my thinking I've realized that for me it's effectively three things
in one (and I feel differently about each facet). So here are my three
faces of sudo:
sudoas a replacement for having specific setuid programs. You're using it to give (passwordless) access to something for some group of people (or everyone); instead of writing a setuid program you usesudoto run a non-setuid program or script with the necessary privileges. Often you may want to wrap thesudoinvocation up in a cover script so you can tell people 'just run/some/script'.sudoas a way of giving non-sysadmin staff limited and guarded access to especially privileged and dangerous operations. This is the traditional 'operators are allowed to runreboot' situation, which I'll summarize as 'restricted root powers'. Here the people usingsudoare not full sysadmins and are not trusted to hold unrestricted root privileges.sudoas the way you access unrestricted root privileges, where use ofsudoreplacessu. You're encouraged to usesudoto run specific commands (even a bunch of commands) instead of using it to just get a root shell and then doing stuff from there.(In practice, use of
sudothis way temporarily turns your current shell session into a peculiar privileged hybrid environment where you can use root powers casually by prefixing a command withsudo.)
I think that there are lots of uses for sudo as a replacement for
setuid programs. Setuid programs are hard to write securely and can
only be written in a few languages. Using sudo lets you more or
less safely write 'setuid' programs in, say, shell scripts or Perl
or the like. Invocation of them is a bit funny (you have to say
'sudo <program>') but that can be hidden by a cover script. We use
this here for a number of things (eg) and it works great.
I'm less sanguine about sudo as a way to give out restricted root
powers, especially if you let people run ordinary programs instead of
only custom-designed scripts. Unless you're very careful it's easy
to accidentally give people a way out of your restricted jail, since
programs are generally not designed to enforce a restricted environment
and contain all sorts of odd holes. For instance, if you allow people to
run 'vi /some/file' as root you've just given them full root access if
they want it. The whole area is a massive minefield if you're faced with
an attacker.
(This doesn't require your operators to be malicious. Unfortunately you've turned compromising an operator account into a great path towards root access.)
My feelings about sudo as a replacement for su are sufficiently
complicated that they don't fit in this entry. The short version is that
I think you're likely to be creating a different security model with
different risks; how different they are depends on how you configure
sudo. The more you make the risks of sudo match the risks of su,
the more you turn sudo into su.