Does having a separate daemon manager help system resilience?

December 16, 2014

One of the reasons usually put forward for having a separate daemon manager process (instead of having PID 1 do this work) is that doing so increases overall system resilience. As the theory goes, PID 1 can be made minimal and extremely unlikely to crash (unlike a more complex PID 1), while if the more complicated daemon manager does crash it can be restarted.

Well, maybe. The problem is the question of how well you can actually take over from a crashed daemon manager. Usually this won't be an orderly takeover and you can't necessarily trust anything in any auxiliary database that the daemon manager has left behind (since it could well have been corrupted before or during the crash). You need to have the new manager process step in and somehow figure out what was (and is) running and what isn't, then synchronize the state of the system back to what it's supposed to be, then pick up monitoring everything.

The simple case is a passive init system. Since the init system does not explicitly track daemon state, there is no state to recover on a daemon manager restart and resynchronization can be done simply by trying to start everything that should be started (based on runlevel and so on). We can blithely assume that the 'start' action for everything will do nothing if the particular service is already started. Of course this is not very realistic, as passive init systems generally don't have daemon manager processes that can crash in the first place.

For an active daemon manager, I think that at a minimum what you need is some sort of persistent and stable identifier for groups of processes that can be introspected and monitored from an arbitrary process. The daemon manager starts processes for all services under a an identifier determined from their service name; then when it crashes and you have to start a new one, the new one can introspect the identifiers for all of the groups to determine what services are (probably) running. Unfortunately there are lots of complications here, including that this doesn't capture the state of 'one-shot' services without persistent processes. This is of course not a standard Unix facility, so no fully portable daemon manager can do this.

It's certainly the case that a straightforward, simple daemon manager will not be able to take over from a crashed instance of itself. Being able to do real takeover requires both system-specific features and a relatively complex design and series of steps on startup, and still leaves you with uncertain or open issues. In short, having a separate daemon manager does not automatically make the system any more resilient under real circumstances. A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.

However I think it's fair to say that under normal circumstances a separate daemon manager process crashing (instead of PID 1 crashing) will buy you more time to schedule a system outage. If the only thing that needs the daemon manager running is starting or stopping services and you already have all normal services started up, your system may be able to run for days before you need to reboot it. If your daemon manager is more involved in system operation or is routinely required to restart services, well, you're going to have (much) less time depending on the exact details.

Comments on this page:

I have been working with a few daemon managers in the last months and I have found myself using monit a lot. One of its main features is, it does not keep track of the state of the daemons it manages, it just probes the system for that state when it needs it (using the status command of an init script, a PID file or I think more or less anything else). In fact I have been toying with the idea of writing a minimalistic PID 1 and letting monit handle the daemons, as an exercise. ;-)

A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.

A crashing daemon manager should result in an orderly reboot, with enough time before KILLing all processes, so that daemons can shut down properly. A crashing PID 1 results in a kernel panic, instantly halting all processes, probably not even syncing the disks.

(We certainly agree that both should never happen, what counts now is to minimize data loss.)

I didn’t think anyone was under any illusions about whether a crashed dæmon manager is bad news; this almost seems like taking down a straw man, one that also presumes a fallaciously narrow view of resilience as the ability to survive a problem unaffected. The point is to try to contain failure modes instead of allowing them to escalate – to fail gracefully –, as catastrophic failures are almost always the result of failure cascades. In that sense, an uncrashable PID 1 quite tautologically makes a system more resilient than a crashable one.

By cks at 2014-12-17 12:58:47:

I think I've seen a certain amount of people who felt that a separate daemon manager was automatically more resilient than having it in PID 1. Perhaps I just hang out in (or read) the wrong sorts of things.

In general, my long standing view of resilience is that you must take a total system view of what's going on. If the system has to be rebooted essentially immediately after a process crashes (and you can do very little in the mean time, for example because you can't become root), there's relatively little you care about exactly which process crashed. A crashing daemon manager can have a better failure mode than this (and you'd sort of hope so), but not necessarily. As a result I'm quite wary of claims that one design is intrinsically better than another, because the devil is always in the implementation details.

(Systemd has provided an existence proof that PID 1 taking a segfault doesn't necessarily lead to the system rebooting on the spot, even if life is not so great afterwards. For that matter, 'reboot on PID 1 exiting' is kernel behavior that can and perhaps should be changed.)

Perhaps I am out of touch with just how bad the state of the debate really is. (I have noticed that a lot of systemd criticism amounts to some form of “it offends my beliefs in The One Unix Way”, with incantations of dogma invoked as a cargo cult of technical rationale.)

I agree that resilience can only be understood holistically. But unless PID 1 sheds all of its special kernel status, I fail to imagine any way in which non-robust PID 1 can be a positive trade-off. (Which is not the same as saying there is no way. If you can give a counterexample, please do, so I can change my mind.)

As best I can think, if you get everything else really right, then this bit being suboptimal will be of little consequence. (Or else if you get everything else very wrong, then this bit being suboptimal will likewise be of little consequence…) It’s one single design decision: clearly it cannot be a panacea.

It just seems silly to flub it for no apparent appreciable benefit.

Isn’t it?

Written on 16 December 2014.
« How a Firefox update just damaged practical security
The potential end of public clients at the university? »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 16 23:53:56 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.