Wandering Thoughts archives


Does having a separate daemon manager help system resilience?

One of the reasons usually put forward for having a separate daemon manager process (instead of having PID 1 do this work) is that doing so increases overall system resilience. As the theory goes, PID 1 can be made minimal and extremely unlikely to crash (unlike a more complex PID 1), while if the more complicated daemon manager does crash it can be restarted.

Well, maybe. The problem is the question of how well you can actually take over from a crashed daemon manager. Usually this won't be an orderly takeover and you can't necessarily trust anything in any auxiliary database that the daemon manager has left behind (since it could well have been corrupted before or during the crash). You need to have the new manager process step in and somehow figure out what was (and is) running and what isn't, then synchronize the state of the system back to what it's supposed to be, then pick up monitoring everything.

The simple case is a passive init system. Since the init system does not explicitly track daemon state, there is no state to recover on a daemon manager restart and resynchronization can be done simply by trying to start everything that should be started (based on runlevel and so on). We can blithely assume that the 'start' action for everything will do nothing if the particular service is already started. Of course this is not very realistic, as passive init systems generally don't have daemon manager processes that can crash in the first place.

For an active daemon manager, I think that at a minimum what you need is some sort of persistent and stable identifier for groups of processes that can be introspected and monitored from an arbitrary process. The daemon manager starts processes for all services under a an identifier determined from their service name; then when it crashes and you have to start a new one, the new one can introspect the identifiers for all of the groups to determine what services are (probably) running. Unfortunately there are lots of complications here, including that this doesn't capture the state of 'one-shot' services without persistent processes. This is of course not a standard Unix facility, so no fully portable daemon manager can do this.

It's certainly the case that a straightforward, simple daemon manager will not be able to take over from a crashed instance of itself. Being able to do real takeover requires both system-specific features and a relatively complex design and series of steps on startup, and still leaves you with uncertain or open issues. In short, having a separate daemon manager does not automatically make the system any more resilient under real circumstances. A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.

However I think it's fair to say that under normal circumstances a separate daemon manager process crashing (instead of PID 1 crashing) will buy you more time to schedule a system outage. If the only thing that needs the daemon manager running is starting or stopping services and you already have all normal services started up, your system may be able to run for days before you need to reboot it. If your daemon manager is more involved in system operation or is routinely required to restart services, well, you're going to have (much) less time depending on the exact details.

unix/DaemonManagerResilience written at 23:53:56; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.