The different contexts of stopping a Unix daemon or service

September 12, 2017

Most Unix init systems have a single way of stopping a daemon or a service, and on the surface this feels correct. And mostly it is, and mostly it works. However, I've recently come around to believing that this is a mistake and an over-generalization. I now believe that there are three different contexts and you may well want to stop things somewhat differently in each, depending on the daemon or service. This is especially the case if the daemon spawns multiple and somewhat independent processes as part of its operation, but it can happen in other situations as well, such as the daemon handling relatively long-running requests. To make this concrete I'm going to use the case of cron and long-running cron jobs, as well as Apache (or the web server of your choice).

The first context of stopping a daemon is a service restart, for example if the package management system is installing an updated version. Here you often don't want to abruptly stop everything the daemon is running. In the case of cron, you probably don't want a daemon restart to kill and perhaps restart all currently running cron jobs; for Apache, you probably want to let current requests complete, although this depends on what you're doing with Apache and how you have it configured.

The second context is taking down the service with no intention to restart it in the near future. You're stopping Apache for a while, or perhaps shutting down cron during a piece of delicate system maintenance, or even turning off the SSH daemon. Here you're much more likely to want running cron jobs, web requests, and even SSH logins to shut down, although you may want the init system to give them some grace time. This may actually be two contexts, one where you want a relatively graceful stop versus one where you really want an emergency shutdown with everything screeching to an immediate halt.

The third context is stopping the service during system shutdown. Here you unambiguously want everything involved with the daemon to stop, because everything on the system has to stop sooner or later. You almost always want everything associated with the daemon to stop as a group, more or less at the same time; among other reasons this keeps shutdown ordering sensible. If you need Apache to shut down before some backend service, you likely don't want lingering Apache sub-processes hanging around just because their request is taking a while to finish. Or at a minimum you don't want Apache to be considered 'down' for shutdown ordering until the last little bits die off.

As we see here, the first and the third context can easily conflict with each other; what you want for service restart can be the complete opposite of what you want during system shutdown. And an emergency service stop might mean you want an even more abrupt halt than you do during system shutdown. In hindsight, trying to treat all of these different contexts the same is over-generalization. The only time when they're all the same is when you have a simple single-process daemon, at which point there's only ever one version of shutting down the daemon; if the daemon process isn't running, that's it.

(As you might suspect, these thoughts are fallout from our Ubuntu shutdown problems.)

PS: While not all init systems are supervisory, almost all of them include some broad idea of how services are stopped as well as how they're started. System V init is an example of a passive init system that still has a distinct and well defined process for shutting down services. The one exception that I know of is original BSD, where there was no real concept of 'shutting down the system' as a process; instead reboot simply terminated all processes on the spot.


Comments on this page:

By Ewen McNeill at 2017-09-12 17:37:32:

It seems to me that your three "contexts" are fairly equivalent, in concept, to the three types of Unix signal conventions for services:

  • The first context is analogous to HUP ("restart with new config"; although here you want an actual restart, not just "reload config" that happens with some daemons)

  • The graceful stop in the second context is TERM ("please stop")

  • The third context is KILL ("you will cease running, now")

Trying to use a single command (service FOO stop) to do all three of those isn't ideal, particularly to do the last one. In theory the first is "restart" (service FOO restart), but at least some of the time "restart" just does "stop"/"start" under the hood. There may be more combinations, and really they should have their own verbs. Apache httpd has "graceful" and "restart" for the two variations of the second context, for example.

On reboot/shutdown, older Linux init systems (eg, Debian sysvinit) would send TERM signals to things, wait a couple of minutes, send KILL signals to things, and then just carry on anyway. It appears systemd's belief in the power of Dependency Based Resolution is sufficiently strong to ignore such pragmatic steps towards "progress must be made" and wait "indefinitely"; it also appears that we're only about half way through the process of discovering and listing all the dependencies required (shutting down the network before the network file systems is obviously poorly thought out...).

Ewen

A well-expressed argument, as I've come to expect here.

`apachectl -k graceful` makes more sense than my first understanding of it. You don't strictly need support for it in init, but it could make things less confusing. It's interesting that the Debian LSB script relegated graceful-restart to a non-standard verb; it is not the default.

One observation from working on systemd: both start and stop jobs have set concepts of success. If they don't reach it within a configurable timeout, they ultimately fire the SIGKILL cannon. Restart is pretty much the same. So you get a shiny new process (or failure) within a given time, even if the process had locked up...

OTOH the default timeout is 90 seconds. By then, my human timeout would have expired, and I'd probably be looking at killing the process myself. Ideally after trying to work out what it's doing - and I wouldn't want to be bounded by an automatic SIGKILL timeout. The timeouts make most sense for system shutdown.

So I don't think I have a reason there that the other cases need to be treated the same as system shutdown. Other than avoiding proliferating cases if possible, for ease of analysis. Arguably systemd followed "New Jersey style" here, awaiting arguments (and resources) that the model absolutely needs to become more complex.

Implementing `service stop ssh` as terminating ssh sessions sounds cool to me. I'm guessing it's not a model the unported upstream ssh supports though.

In my head, cron jobs have this annoying problem that some people use them to run updates. At the same time package managers tend to recommend not killing them, i.e. when you stop cron at the same time it's starting an update. Particularly the redhat package manager. In systemd, `dnf` registers an inhibitor with logind that blocks system shutdown. If you make sure to call dnf through packagekitd, that's even reasonably thorough... It seems the implication is more that package managers need to be init-aware, like PK is, once you have inits that track and kill child processes. And (/or?) that rpm really ought to be crash-safe, but that was obvious anyway. So I don't have an argument here either.

Written on 12 September 2017.
« Giving users what they want and expect, IMAP edition
System shutdown is complicated and involves policy decisions »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 12 01:12:41 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.