Killing (almost) all processes on Linux is not recoverable

March 20, 2014

Suppose that you have at least a semi-hung system that you're taking drastic measures to get at least semi-alive again; for example, you might use Magic Sysrq's option to send a SIGTERM or SIGKILL to all processes except init ('e' or 'i', per here). If you do this, it's quite possible that your system will stagger dazedly around for a bit and then seem to come back to life. Oh, sure, maybe you need to restart a few daemons, but it can easily look like you can keep going without actually rebooting the machine. You can, right?

Based on painful experience, let me answer the question simply: no.

In practice there is no even vaguely easy way to recover a modern Linux system to full functionality after you've killed almost all processes. You can get something back that looks like it's working, but what you really have is a partial zombie. You can spend quite literally months finding things in the corners that are not working; if you're lucky, they will be not working in some noisy way and diagnosing them will be obvious. It's quite possible to not be lucky.

So if you are ever in a situation like this with Magic Sysrq or the like, reboot your system after using drastic actions to wake it up even if it seems okay afterwards. Things like Sysrq-e and Sysrq-i are for temporary diagnostics (to answer questions like 'is this hang probably because of a user-level process doing bad things'), not for cures. The cure is a reboot.

Another way to do this is an accidental 'kill -SIGNAL -1' for some signal that your init ignores. As an interesting example, it appears that systemd ignores SIGHUP so the traditional accidental 'kill -1 -1' as root might do this on a systemd system. After something like this your system may look fine, especially after you restart some daemons, but it is not. Reboot. Really. It's simpler and much less painful over the long run and you're going to wind up doing it sooner or later anyways.

PS: as I found out in the same incident, immediately turn up the log level when using Magic Sysrq.

Written on 20 March 2014.
« Why I like ZFS's zfs send and zfs receive
Thinking about when rsync's incremental mode doesn't help »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 20 00:17:49 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.