A clever way of killing groups of processes

September 23, 2017

While reading parts of the systemd source code that handle late stage shutdown, I ran across an oddity in the code that's used to kill all remaining processes. A simplified version of the code looks like this:

void broadcast_signal(int sig, [...]) {
   [...]
   kill(-1, SIGSTOP);

   killall(sig, pids, send_sighup);

   kill(-1, SIGCONT);
   [...]
}

(I've removed error checking and some other things; you can see the original here.)

This is called to send signals like SIGTERM and SIGKILL to everything. At first the use of SIGSTOP and SIGCONT puzzled me, and I wondered if there was some special behavior in Linux if you SIGTERM'd a SIGSTOP'd process. Then the penny dropped; by SIGSTOPing processes first, we're avoiding any thundering herd problems when processes start dying.

Even if you use kill(-1, <signal>), the kernel doesn't necessarily guarantee that all processes will receive the signal at once before any of them are scheduled. So imagine you have a shell pipeline that's remained intact all the way into late-stage shutdown, and all of the processes involved in it are blocked:

proc1 | proc2 | proc3 | proc4 | proc5

It's perfectly valid for the kernel to deliver a SIGTERM to proc1, immediately kill the process because it has no signal handler, close proc1's standard output pipe as part of process termination, and then wake up proc2 because now its standard input has hit end-of-file, even though either you or the kernel will very soon send proc2 its own SIGTERM signal that will cause it to die in turn. This and similar cases, such as a parent waiting for children to exit, can easily lead to highly unproductive system thrashing as processes are woken up unnecessarily. And if a process has a SIGTERM signal handler, the kernel will of course schedule it to wake up and may start it running immediately, especially on a multi-core system.

Sending everyone a SIGSTOP before the real signal completely avoids this. With all processes suspended, all of them will get your signal before any of them can wake up from other causes. If they're going to die from the signal, they'll die on the spot; they're not going to die (because you're starting with SIGTERM or SIGHUP and they block or handle it), they'll only get woken up at the end, after most of the dust has settled. It's a great solution to a subtle issue.

(If you're sending SIGKILL to everyone, most or all of them will never wake up; they'll all be terminated unless something terrible has gone wrong. This means this SIGSTOP trick avoids ever having any of the processes run; you freeze them all and then they die quietly. This is exactly what you want to happen at the end of system shutdown.)


Comments on this page:

POSIX specifies (http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html):

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Therefore what you describe isn't necessary. In your chain example the processes are each guaranteed to receive the TERM signal (unless they happen to have another signal pending) immediately. AFAIK Linux conforms to this. I suspect the author of the code you highlight simply didn't understand this semantic.

From 193.219.181.253 at 2017-09-23 11:38:03:

Therefore what you describe isn't necessary. In your chain example the processes are each guaranteed to receive the TERM signal (unless they happen to have another signal pending) immediately. AFAIK Linux conforms to this. I suspect the author of the code you highlight simply didn't understand this semantic.

There's a more practical reason for not simply using kill(-1, sig). Part of the code that was omitted from the short example is that some processes might be intentionally excluded by the killall loop (e.g. network storage daemons with @ in argv[0][0]) – they were started from initramfs, they intend to stick around even when everything is being SIGKILLed, and will go down with the ship when the system powers off. (Or, al­ter­na­tive­ly, they'll be cleaned up after systemd-shutdown pivots back to the initramfs.)

By Davin at 2017-09-23 14:02:26:
   There's a more practical reason for not simply using kill(-1, sig)

Ah right, I missed the subtlety that it's killing a selected group of pids rather than all of them. If you loop through a bunch of pid's and SIGTERM them individually then, yes, you potentially have some of them wake up and process say a SIGPIPE which was ultimately caused by the death of another process in the set. This isn't likely to result in any sort of thundering herd issue though: simply sending the SIGTERM to one process isn't going to cause other processes to be scheduled immediately, generally speaking, even if they end up receiving SIGPIPE; it's more likely that the SIGTERM signals all get queued successfully before any of the terminated processes get scheduled at all (well, give or take).

So, I would think it's more likely that the STOP/CONT pair are designed to create a stable process tree which can then be walked to build up a list of processes which actually need to be killed. By STOPping all other processes you prevent them from forking or worse, dieing and the process ID being re-used.

By cks at 2017-09-23 15:31:37:

There's another problem with what POSIX specifies here. I'll emphasise a bit:

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread [...]

The sending process is the process calling kill(). What POSIX requires here is that if you send yourself a signal, and no other thread can handle the signal (including because you're not using threading and there are no other threads), the signal must be delivered to you and handled before kill() returns. In other words, if you signal yourself it is a synchronous operation, not an asynchronous one; you call kill(), your signal handler runs, and then kill() returns and your code carries on.

(The rationale section of the POSIX page discusses this and also why some of the odd language here is necessary.)

Since this is about sending a signal to yourself, it puts no requirement on implementations to make sending signals to other processes into a synchronous operation. As far as I can see, nothing in the rest of the description does either.

By cks at 2017-09-23 15:37:07:

Also, Davin is right about the usefulness of STOP'ing all processes to create a stable process tree; that's something I overlooked completely. In my case everything that made it to late stage shutdown was frozen and not doing anything, but you can certainly have processes that are still active and fork()ing and so on. Freezing them all in place so you can catch them all is obviously necessary for reliability.

I don't know what the original intention of the systemd people was about this code, so maybe they just wanted the freeze-in-place advantage and got thundering herd avoidance as a side benefit. The whole thing makes an interesting commentary on what you see when you see code in isolation.

From 193.219.181.253 at 2017-09-24 06:08:16:

So, I would think it's more likely that the STOP/CONT pair are designed to create a stable process tree which can then be walked to build up a list of processes which actually need to be killed. By STOPping all other processes you prevent them from forking or worse, dieing and the process ID being re-used.

Ah, I'd completely forgotten about that. Yes, I think that was one of the main goals.

---

When the Linux Plumber's Wishlist v3 was published in 2012, it specifically included a mention of race-free killing of a whole cgroup so that `systemctl stop` or `systemctl kill` would be able to deal with misbehaving daemons.

Eventually the kernel did implement a "pids" cgroup controller which can be used for this purpose, but as far as I can see, so far systemd only uses it as a runtime safety net (TasksMax=), not as a stop/shutdown helper...

Written on 23 September 2017.
« Using a watchdog timer in system shutdown with systemd (on Ubuntu 16.04)
Reading code and seeing what you're biased to see, illustrated »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Sep 23 02:42:54 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.