Wandering Thoughts archives

2014-12-16

Does having a separate daemon manager help system resilience?

One of the reasons usually put forward for having a separate daemon manager process (instead of having PID 1 do this work) is that doing so increases overall system resilience. As the theory goes, PID 1 can be made minimal and extremely unlikely to crash (unlike a more complex PID 1), while if the more complicated daemon manager does crash it can be restarted.

Well, maybe. The problem is the question of how well you can actually take over from a crashed daemon manager. Usually this won't be an orderly takeover and you can't necessarily trust anything in any auxiliary database that the daemon manager has left behind (since it could well have been corrupted before or during the crash). You need to have the new manager process step in and somehow figure out what was (and is) running and what isn't, then synchronize the state of the system back to what it's supposed to be, then pick up monitoring everything.

The simple case is a passive init system. Since the init system does not explicitly track daemon state, there is no state to recover on a daemon manager restart and resynchronization can be done simply by trying to start everything that should be started (based on runlevel and so on). We can blithely assume that the 'start' action for everything will do nothing if the particular service is already started. Of course this is not very realistic, as passive init systems generally don't have daemon manager processes that can crash in the first place.

For an active daemon manager, I think that at a minimum what you need is some sort of persistent and stable identifier for groups of processes that can be introspected and monitored from an arbitrary process. The daemon manager starts processes for all services under a an identifier determined from their service name; then when it crashes and you have to start a new one, the new one can introspect the identifiers for all of the groups to determine what services are (probably) running. Unfortunately there are lots of complications here, including that this doesn't capture the state of 'one-shot' services without persistent processes. This is of course not a standard Unix facility, so no fully portable daemon manager can do this.

It's certainly the case that a straightforward, simple daemon manager will not be able to take over from a crashed instance of itself. Being able to do real takeover requires both system-specific features and a relatively complex design and series of steps on startup, and still leaves you with uncertain or open issues. In short, having a separate daemon manager does not automatically make the system any more resilient under real circumstances. A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.

However I think it's fair to say that under normal circumstances a separate daemon manager process crashing (instead of PID 1 crashing) will buy you more time to schedule a system outage. If the only thing that needs the daemon manager running is starting or stopping services and you already have all normal services started up, your system may be able to run for days before you need to reboot it. If your daemon manager is more involved in system operation or is routinely required to restart services, well, you're going to have (much) less time depending on the exact details.

DaemonManagerResilience written at 23:53:56; Add Comment

2014-12-14

How init wound up as Unix's daemon manager

If you think about it, it's at least a little bit odd that PID 1 wound up as the de facto daemon manager for Unix. While I believe that the role itself is part of the init system as a whole, this is not the same thing as having PID 1 do the job and in many ways you'd kind of expect it to be done in another process. As with many things about Unix, I think that this can be attributed to the historical evolution Unix has gone through.

As I see the evolution of this, things start in V7 Unix (or maybe earlier) when Research Unix grew some system daemons, things like crond. Something had to start these, so V7 had init run /etc/rc on boot as the minimal approach. Adding networking to Unix in BSD Unix increased the number of daemons to start (and was one of several changes that complicated the whole startup process a lot). Sun added even more daemons with NFS and YP and so on and either created or elaborated interdependencies among them. Finally System V came along and made everything systematic with rcN.d and so on, which was just in time for yet more daemons.

(Modern developments have extended this even further to actively monitoring and restarting daemons if you ask them to. System V init could technically do this if you wanted, but people generally didn't use inittab for this.)

At no point in this process was it obvious to anyone that Unix was going through a major sea change. It's not as if Unix went in one step from no daemons to a whole bunch of daemons; instead there was a slow but steady growth in both the number of daemons and the complexity of system startup in general, and much of this happened on relatively resource-constrained machines where extra processes were a bad idea. Had there been a single giant step, maybe people would have sat down and asked themselves if PID 1 and a pile of shell scripts were the right approach and said 'no, it should be a separate process'. But that moment never happened; instead Unix basically drifted into the current situation.

(Technically speaking you can argue that System V init actually does do daemon 'management' in another process. System V init doesn't directly start daemons; instead they're started several layers of shell scripts away from PID 1. I call it part of PID 1 because there is no separate process that really has this responsibility, unlike the situation in eg Solaris SMF.)

InitDaemonManagerHistory written at 00:55:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.