What an init system needs to do in the abstract
I've talked before about what
init does historically, but that's not the same thing as what an
init system actually needs to do, considered abstractly and divorced
from the historical paths that got us here and still influence how
we think about init systems. So, what does a modern init system in
a modern Unix need to do?
At the abstract level, I think a modern init system has three jobs:
- Being the central process on the system. This is both the modest
job of being PID 1 (inheriting parentless processes and reaping
them when they die) and the larger, more important job of supervising
and (re)starting any other components of the init system.
- Starting and stopping the system, and also transitioning it
between system states like single user and multiuser. The second
job has diminished in importance over the years; in practice most
systems today almost never transition between runlevels or the
equivalent except to boot or reboot.
(At one point people tried to draw a runlevel distinction between 'multiuser without networking' and 'multiuser with networking' and maybe 'console text logins' and 'graphical logins with X running' but today those distinctions are mostly created by stopping and starting daemons, perhaps abstracted through high level labels for collections of daemons.)
- Supervising (daemon) processes to start, stop, and restart them on
demand or need or whatever. This was once a sideline but has
become the major practical activity of an init system and why
people spend most of the time interacting with it. Today this
encompasses both regular
gettyprocesses (which die and restart regularly) and a whole collection of daemons (which are often not expected to die and may not be restarted automatically if they do).
You can split this job into two sorts of daemons, infrastructure processes that must be started in order for the core system to operate (and for other daemons to run sensibly) and service processes that ultimately just provide services to people using the machine. Service processes are often simpler to start, restart, and manage than infrastructure processes.
In practice modern Unixes often add a fourth job, that of managing the appearance and disappearance of devices. This job is not strictly part of init but it is inextricably intertwined with at least booting the system (and sometimes shutting it down) and in a dependency-based init system it will often strongly influence what jobs/processes can be started or must be stopped at any given time (eg you start network configuration when the network device appears, you start filesystem mounts when devices appear, and so on).
The first job mostly or entirely requires being PID 1; at a minimum your PID 1 has to inherit and reap orphans. Since stopping and starting daemons and processes in general is a large part of booting and rebooting, the second and third jobs are closely intertwined in practice although you could in theory split them apart and that might simplify each side. The fourth job is historically managed by separate tools but often talks with the init system as a whole because it's a core dependency of the second and third jobs.
(Booting and rebooting is often two conceptually separate steps in that first you check filesystems and do other initial system setup then you start a whole bunch of daemons (and in shutdown you stop a bunch of daemons and then tear down core OS bits). If you do this split, you might want to transfer responsibility for infrastructure daemons to the second job.)
The Unix world has multiple existence proofs that all of these roles do not have to be embedded in a single PID 1 process and program. In particular there is a long history of (better) daemon supervision tools that people can and do use as replacements for their native init system's tools for this (often just for service daemons), and as I've mentioned Solaris's SMF splits the second and third role out into a cascade of additional programs.
Systemd's fate will be decided by whether or not it works
I have recently been hearing a bunch of renewed grumbling about systemd, probably provoked by the release of RHEL 7 (with a contributing assist from the Debian decision for it and Ubuntu's decision to go along with Debian). There are calls for a boycott or moving away from systemd-using Linuxes, perhaps to FreeBSD, for example. My personal view is that such things misread the factors that will drive both sides of the decision about systemd, that will sway many people either passively for or actively against it.
What it all comes down to is that operating systems are commodities and this commodification extends to the init system. For most people, the purpose of an OS, a Linux distribution, a method of configuring the network, and an init system is to run your applications and keep your system going without causing you heartburn (ideally all of them will actually help you). For (some) management and organizations, an additional purpose is making things not their fault. Technical specifics are almost always weak influences at best.
(It is worth saying explicitly that this is as it should be. The job of computer systems is to serve the needs of the organization; they can and must be judged on how well they do that. Other factors are secondary. Note that this doesn't mean that other factors are irrelevant; in a commodity market, there may be many solutions that meet the needs and so you can choose among them based on secondary factors.)
This cuts both ways. On the one hand, it means that generally no one is really going to care if you run FreeBSD instead of Linux (or Linux X instead of Linux Y) because you want to, provided that everything keeps working or at most that things are only slightly worse from their perspective. On the other hand, it also means that most sysadmins don't care deeply about the technical specifics of what they're running provided that it works.
You can probably see where this is going. If (and only if) systemd works, most people won't care about it. Most sysadmins are not going to chuck away perfectly good RHEL 7, Debian, or Ubuntu systems on the principle that systemd is icky, especially if this requires them to step down to systems that are less attractive apart from not having systemd. In fact most sysadmins are probably only vaguely aware of systemd, especially if things just work on their systems.
On the other hand, if systemd turns out to make RHEL 7, Debian, or Ubuntu machines unstable we will see a massive pushback and revolt. No amount of blandishment from any of the powers that be can make sysadmins swallow things that give them heartburn; a glowing example of this is SELinux, which Red Hat has been trying to push on people for ages with notable lack of success. If Red Hat et al cannot make systemd work reliably and will not replace it posthaste, people will abandon them for other options that work (be those other Linuxes or FreeBSD). And if systemd works well only in some environments, only people in those environments will have the luxury of ignoring it.
That is why I say that systemd's fate will be decided by whether or not it works. If it does work, inertia means that most sysadmins will accept it because it is part of a commodity that they've already picked for other reasons and they likely don't care about the exact details of said commodity. If it doesn't work, that's just it; people will move to systems that do work in one way or another, because that's the organizational imperative (systems that don't work are expensive space heaters or paperweights).
Sidebar: The devil's advocate position
What I've written is only true in the moderate term. In the long term, systemd's fate is in the hands of both Linux distribution developers in general and the people who can write better versions of what it does. If those people are and remain dissatisfied with systemd, it's very likely to eventually get supplanted and replaced. Call this the oyster pearl story of Linux evolution, where people not so much scratch an itch (in the sense of a need) as scratch an irritation.
The kernel should not generate messages at the behest of the Internet
Here is a kernel message that one of my machines logged recently:
sit: Src spoofed 184.108.40.206/2002:4d4d:4d07::4d4d:4d07 -> 220.127.116.11/2002:8064:333::1
Did I say 'a message'? Actually, no, I meant 493 messages in a few days (and it would be more if I had not used iptables to block these packets). Perhaps you begin to see the problem here. This message shows two failures. The first is that it's not usefully ratelimited. This exact message was repeated periodically, often in relatively close succession and with no intervening messages, yet it was not effectively ratelimited and suppressed.
(The kernel code uses
net_warn_ratelimited() but this is
clearly not ratelimited enough.)
The second and more severe failure is the kernel should not default
to logging messages at the behest of the Internet. If you have a
sit tunnel up for 6to4, anyone on the
Internet can flood your kernel logs with their own version of this
message; all they have to do is hand-craft a 6to4 packet with the
wrong IPv6 address. As we've seen here, such packets can probably
even be generated by accident or misconfiguration or perhaps funny
routing. Allow me to be blunt: the kernel should not be handing
this power to people on the Internet. Doing so is a terrible idea
for all of the usual reasons that giving Internet strangers any
power over your machine is a bad idea.
These messages should not be generated by default (at any logging level, because there is no logging level that means 'only log messages that are terrible ideas'). If the kernel wants to generate them, it can and should be controlled by a sysctl or a sysfs option or the like that defaults to off. People who really really want to know can then turn it on; the rest of us will leave it off in our usual great indifference to yet another form of Internet badness.
(Since I haven't been this quite this harsh on kernel messages earlier, I'll admit it: my attitude on kernel messages has probably steadily gotten stricter and more irritated over time. I should probably write down my thoughts on good kernel messages sometime.)
Sidebar: what this message means
A 6to4 encapsulated packet has two addresses; the outside IPv4 address and the inner IPv6 address. The kernel insists that the inner IPv6 address is the IPv4 address's 6to4 address. Here the outside source is 18.104.22.168 but the inner 6to4 address in 2002::/16 is for the IPv4 address 22.214.171.124. You can get a similar message if the destination address has a mismatch between the IPv4 address and the 6to4 IPv6 address.
(To decode the 6to4 IPv6 address, take off the leading 2002: bit and then convert the next four hex octets to decimal bytes; each byte is one digit in the address. So the source is claimed to be 4d.4d.4d.07 aka 126.96.36.199. We can follow the same procedure for the destination address, getting (hex) 80.64.03.33 aka decimal 188.8.131.52, which matches the outer IPv4 address.)
A DTrace script to help figure out what process IO is slow
I recently made public a dtrace script I wrote, which gives you per file descriptor IO breakdowns for a particular process. I think it's both an interesting, useful tool and probably not quite the right approach to diagnose this sort of problem, so I want to talk about both the problem and what it tells you. To start with, the problem.
Suppose, not entirely hypothetically, that you have a relatively complex multi-process setup with data flowing between the various processes and the whole thing is (too) slow. Somewhere in the whole assemblage is a bottleneck. Basic monitoring tools for things like disk IO and network bandwidth will give you aggregate status over the entire assemblage, but they can only point out the obvious bottlenecks (total disk IO, total network bandwidth, etc). What we'd like to do here is peer inside the multi-process assemblage to see which data flows are fast and which are slow. This per-data-flow breakdown is why the script shows IO on a per file descriptor basis.
What the DTrace script's output looks like is this:
s fd 7w: 10 MB/s waiting ms: 241 / 1000 ( 10 KB avg * 955) p fd 8r: 10 MB/s waiting ms: 39 / 1000 ( 10 KB avg * 955) s fd 11w: 0 MB/s waiting ms: 0 / 1000 ( 5 KB avg * 2) p fd 17r: 0 MB/s waiting ms: 0 / 1000 ( 5 KB avg * 2) s fd 19w: 12 MB/s waiting ms: 354 / 1000 ( 10 KB avg * 1206) p fd 21r: 12 MB/s waiting ms: 43 / 1000 ( 10 KB avg * 1206) fd 999r: 22 MB/s waiting ms: 83 / 1000 ( 10 KB avg * 2164) fd 999w: 22 MB/s waiting ms: 595 / 1000 ( 10 KB avg * 2164) IO waits: read: 83 ms write: 595 ms total: 679 ms
(This is a per-second figure averaged over ten seconds and file
descriptor 999 is for the total read and write activity.
can be used to tell what each file descriptor is connected to if
you don't already know.)
Right away we can tell a fair amount about what this process is
doing; it's clearly copying two streams of data from inputs to
outputs (with a third one not doing much). It's also spending much
more of its IO wait time writing the data rather than waiting for
there to be more input, although the picture here is misleading
because it's also making
pollsys() calls and I wasn't tracking
the time spent waiting in those (or the time spent in other syscalls).
(The limited measurement is partly an artifact of what I needed to diagnose our problem.)
What I'm not sure about this DTrace script is if it's the most useful and informative way to peer into this problem. Its output points straight to network writes being the bottleneck (for reasons that I don't know) but that discovery seems indirect and kind of happenstance, visible only because I decided to track how long IO on each file descriptor took. In particular it feels like there are things I ought to be measuring here that would give me more useful and pointed information, but I can't think of what else to measure. It's as if I'm not asking quite the right questions.
(I've looked at Brendan Gregg's Off-CPU Analysis; an off-cpu flamegraph analysis actually kind of pointed in the direction of network writes too, but it was hard to interpret and get too much from. Wanting some degree of confirmation and visibility into this led me to write fdrwmon.d.)
Some uses for
SIGSTOP and some cautions
If you ask, many people will tell you that Unix doesn't have a
general mechanism for suspending processes and later resuming them.
These people are correct in general, but sometimes you can cheat
and get away with a good enough substitute. That substitute is
SIGSTOP, which is at the core of job control.
Although processes can catch and react to other job control signals,
SIGSTOP is a non-blockable signal like
SIGKILL (aka '
kill -9'). When a process is sent it, the kernel
stops the process on the spot and suspends it until the process
SIGCONT (more or less). You can thus pause processes and
continue them by manually sending them
appropriate and desired.
(Since it's a regular signal, you can use a number of standard
mechanisms to send
SIGSTOP to an entire process group or all of
a user's processes at once.)
There are any number of uses for this. Do you have too many processes banging away on the disk (or just think you might)? You can stop some of them for a while. Is a process saturating your limited network bandwidth? Pause it while you get a word in edgewise. And so on. Basically this is more or less job control for relatively arbitrary user processes, as you might expect.
Unfortunately there are some cautions and limitations attached to
SIGSTOP on arbitrary processes. The first one is
straightforward: if you
SIGSTOP something that is talking to the
network or to other processes, its connections may break if you
leave it stopped too long. The other processes don't magically know
that the first process has been suspended and so they should let
it be, and many of them will have limits on how much data they'll
queue up or how long they'll wait for responses and the like. Hit
the limits and they'll assume something has gone wrong and cut your
suspended process off.
(The good news is that it will be application processes that do
this, and only if they go out of their way to have timeouts and
other limits. The kernel is perfectly happy to leave things be for
however long you want to wait before a
The other issue is that some processes will detect and react to one
of their children being hit with a
SIGSTOP. They may
the child or they may kill the process outright; in either case
it's probably not what you wanted to happen. Generally you're safest
when the parent of the process you want to pause is something simple,
like a shell script. In particular,
init (PID 1) is historically
somewhat touchy about
SIGSTOP'd processes and may often either
SIGCONT them or kill them rather than leave them be. This is
especially likely if
init inherits a
SIGSTOP'd process because
its original parent process died.
(This is actually relatively sensible behavior to avoid
having a slowly growing flock of orphaned
These issues, especially the second, are why I say that
is not a general mechanism for suspending processes. It's a mechanism
and on one level it always works, but the problem is the potential
side effects and aftereffects. You can't just
SIGSTOP an arbitrary
process and be confident that it will still be there to be continued
ten minutes later (much less over longer time intervals). Sometimes
or often you'll get away with it but every so often you won't.
Some other benefits of using non-HTTP frontend to backend transports
A commentator left a very good comment on my entry on why I don't like HTTP as a frontend to backend transport that points out the security benefits of using a simple protocol instead of a complex one. That makes a good start to talking about the general benefits of using a non-HTTP transport, beyond the basic lack of encapsulation I talked about in my entry.
The straightforward security benefit (as noted by the commentator) is that a simple protocol exposes less attack surface and will lead to implementations that are easier to audit. Very related to this is that full HTTP is a very complicated protocol with many dark corners, especially if people start trying tricky exploits against you. In practice, HTTP used as a transport is unlikely to use full HTTP (and hopefully many of the perverse portions will be sanitized by the frontend); however, just what subset of HTTP it uses is going to be some unpredictable, generally undocumented, and variable (between frontends). As a result, if you're implementing HTTP for a backend you have a problem; put simply, where do you stop? You probably don't want to try to implement the warts and all version of full HTTP, if only because some amount of the code you're writing will never get used in practice, but you don't necessarily know where it's safe to stop.
(Related to that is how do you test your implementation, especially its handling of errors and wacky requests? In a sense, real HTTP servers have a simple life here; you can simply expose them on the Internet and see what crazy things clients send you and expect to work.)
A transport protocol, even a relatively complicated one like FastCGI, gives you a definite answer to this. The protocol is much simpler than HTTP and much more bounded; you know what you have to write and what you have to support.
(There is a devil's advocate take on this that I will get to in a sidebar.)
Another pragmatic advantage is that using a separate transport protocol imposes a strong separation between the HTTP URL, host, port, and so on of the original request and the possible TCP port and endpoint of the transport protocol. Your backend software has to work very hard to confuse the two and thus to generate URLs that use the backend information instead of the real frontend request information. By contrast software behind a HTTP reverse proxy has to take very special care to use the right host, URL, and so on; in many configurations it needs to be specifically configured with the appropriate frontend URL information instead of being able to pull it from the request. This is a perennial issue with software.
Sidebar: the devil's advocate view of full HTTP complexity
Using a non-HTTP transport protocol doesn't necessarily mean that you avoid all of the complexity of HTTP, because after all your application is still dealing with HTTP requests. What your backend gets to avoid is some amount of parsing the HTTP request and sanitizing it; with some frontend servers you can also avoid handling things like compressing your response in a way that the client can deal with. Even under the best of circumstances this still leaves your backend (generally your application and framework) to handle the contents of HTTP headers and the HTTP request itself. This is where a great deal of the complexity of HTTP resides and it's relatively irreducible because the headers contain application level information.
You are also at the mercy of your frontend for how much sanitization is done for you, and this may not be well documented. Is there a header size limit? Does it turn multi-line headers (if a client is perverse enough to send them) into single-line entities or does it leave them with embedded newlines? And so on.
(Let's not even ask about, say, chunked input.)
If you're dealing with the contents of HTTP headers and so on anyways, I think that you can rationally ask if not having to parse the HTTP request is such a great advantage.
Why we don't want to do any NAT with IPv6
In a comment on yesterday's entry on our IPv6 DNS dilemma, Pete suggested that we duplicate our IPv4 'private address space with NAT' solution in IPv6, using RFC 4193 addresses and IPv6 NAT. While this is attractive in that it preserves our existing and well proven architecture intact, there are two reasons I think we want to avoid this (possibly three).
The first reason is simply that NAT is a pain from a technical and administrative perspective once you're working with a heterogenous environment (one where multiple people have machines on your networks). A firewall configuration without NAT is simpler than one with it (especially once you wind up wanting multiple gateway IPs and so on), and on top of that once you have NAT you start needing some sort of traffic tracking system so you can trace externally visible traffic back to its ultimate internal source.
(There are other fun consequences in our particular environment that we would like to get away from. For example, people with externally visible machines can't use the externally visible IP address to talk to those machines once they're inside our network, because the NAT translation is done only at the border.)
The other reason is political. To wit, the university's central networking people aren't very fond of NAT. Among other things, they want to be able to directly attribute network behavior to specific end devices and possibly to block those end devices on the campus backbone. They will be much happier with us if we directly expose end devices via distinct IPv6 addresses than if we aggregate them behind IPv6 NAT gateways, and the vastly larger IPv6 address space means that we have basically no good reason to NAT things.
(The potential third reason is how well OpenBSD IPv6 NAT works. I suspect that IPv6 NAT has not exactly been a priority for the OpenBSD developers.)
Note that in general the source hiding behavior of NAT has drawbacks as well as advantages; to put it crudely, if outsiders can't tell you apart from a bad actor you'll get lumped in with them. In our environment, avoiding this (with no NAT) would be a feature.
An IPv6 dilemma for us: 'sandbox' machine DNS
In our current IPv4 network layout, we have a number of internal 'sandbox' networks for various purposes. These networks all use RFC 1918 private address space and with our split horizon DNS they have entirely internal names (and we have PTR resolution for them and so on). In a so far hypothetical IPv6 future, we would presumably give all of those sandbox machines public IPv6 addresses, because why not (they'd stay behind a firewall, of course). Except that this exposes a little question: what public DNS names do we give them? Especially, what's the result of doing a reverse lookup on one of their IPv6 addresses?
(Despite our split horizon DNS, we do have one RFC 1918 IP address that we've been forced to leak out.)
We can't expose our internal names for these machines because they're not valid global DNS names; they live in an entirely synthetic and private top level zone. We probably don't want to not have any reverse mapping for their IPv6 addresses because that's unfriendly (on various levels) and is likely to trigger various anti-abuse precautions on remote machines that they try to talk to. I think the only plausible answer is that we must expose reverse and forward mappings under our organizational zone (probably under a subzone to avoid dealing with name collision issues). One variant of this would be to expose only completely generic and autogenerated name mappings, eg 'ipv6-NNN.GROUP.etc' or the like; this would satisfy things that need reverse mappings with minimal work and no leakage of internal names.
If we expose the real names of machines through IPv6 DNS people will start using these names, for example for granting access to things. This is fine, except that of course these names only work for IPv6. This too is probably okay because most of these machines don't actually have externally visible IPv4 addresses anyways (they get NAT'd to a firewall IP when they talk to the outside world, and of course the NAT IP address is shared between many internal machines).
(There are some machines that are publically accessible through bidirectional NAT. These machines already have a public name to attach an IPv6 address to and we could make the reverse lookup work as well.)
Overall, I think the simplest solution is to have completely generic autogenerated IPv6 reverse and forward zones that are only visible in our external DNS view and then add IPv6 forward and reverse DNS for appropriate sandboxes to our existing internal zones. This does the minimal amount of work to pacify external things that want reverse DNS while preserving the existing internal names for machines even when you're using IPv6 with them.
The fly in this ointment is that I have no idea if the OpenBSD BIND can easily and efficiently autogenerate IPv6 reverse and forward names, given that there are a lot more of them than there are in typical autogenerated IPv4 names. If it's a problem, I suppose we can have a script that autogenerates the public IPv6 names for any IPv6 address we add to internal DNS.
We don't believe in DHCP for (our) servers
I understand that in some places it's popular for servers to get their IP addresses on boot through DHCP (presumably or usually static IP addresses). I understand the appeal of this for places with large fleets of servers that are frequently being deployed or redeployed; to put it one way I imagine that it involves touching machines less. However it is not something that we believe in or do in our core network, for at least two reasons.
First and lesser, it would add an extra step when setting up a machine or doing things like changing its network configuration (for example to move it to 10G-T). Not only would you have to rack it and connect it up but you'd also have to find out and note down the Ethernet address and then enter it into the DHCP server. Perhaps someday we will all have smartphones and all servers that we buy will come with machine readable QR codes of their MACs, but today neither is at all close to being true (and never mind the MACs of ports on expansion cards).
(By the way, pretty much every vendor who prints MACs on their cases uses way too small a font size.)
Second and greater, it would add an extra dependency to the server boot process. In fact it would add several extra dependencies; we'd be dependent not just on the DHCP server being up and having good data, but also on the network path between the booting server and the DHCP server (here I'm thinking of switches, cables, and so on). The DHCP server would become a central point of near total failure, and we don't like having those unless we're forced into it.
(Sure, we could have two or more DHCP servers. But then we'd have to make very sure their data stayed synchronized and we'd be devoting two servers to something that we don't see much or any advantage from. And two servers with synchronized data doesn't protect against screwups in the data itself. The DHCP server data is a real single point of failure where a single mistake has great potential for serious damage.)
A smaller side issue is that we label our physical servers with what host they are, so assigning IPs (and thus hostnames) through DHCP creates new and exciting ways for a machine's label to not match the actual reality. We also have their power feeds labeled in the network interfaces of our smart PDUs, which would create similar possibilities for exciting mismatches.
The downside of expanding your storage through bigger disks
As I mentioned recently, one of the simplest ways of expanding your storage space is simply to replace your current disks with bigger disks and then tell your RAID system, file system, or volume manager to grow into the new space. Assuming that you have some form of redundancy so you can do this on the fly, it's usually the simplest and easiest approach. But it has some potential downsides.
The simplest way to put the downsides is that this capacity expansion is generally blind and not so much inflexible as static. Your storage systems (and thus your groups) get new space in proportion to however much space (or how many disks) they're currently using, and that's it. Unless you already have shared storage, you can't reassign this extra new space from one entity to another because (for example) one group with a lot of space doesn't need more but another group with only a little space used now is expanding a lot.
This is of course perfectly fine if all of your different groups or filesystems or whatever are all going to use the extra space that you've given them, or if you only have one storage system anyways (so all space flowing to it is fine). But in other situations this rigidity in assigning new space may cause you heartburn and make you wish to reshape the storage to a lesser or greater amount.
(One assumption I'm making is that you're going to do basically uniform disk replacement and thus uniform expansion; you're not going to replace only some disks or use different sizes of replacement disks. I make that assumption because mixed disks are as much madness as any other mixed hardware situation.)