The importance of killing processes in the right order

December 30, 2007

In the old days, we had a mail system with two problems: receiving SMTP mail took a couple of processes per connection, and the SMTP server had no timeout. The result that our central mail server would slowly accumulate more and idle SMTP session processes waiting on zombie PCs that had just yanked the connection away, all of them chewing up memory, swap space, and so on; at the height of things, we might have more than a thousand idle SMTP connections.

(Specifically, there was a process per connection that did the SMTP conversion, and then a separate, fairly heavyweight program to verify addresses; the SMTP server process started the address verification router process when necessary and communicated with it through pipes. Sometimes the router process spawned its own children, for extra fun.)

Once upon a time, I took it upon myself to clean up this situation. This being a Solaris machine, I did:

# kill -9 `pgrep smtpserver`

The machine promptly exploded; we had to force boot it from the serial console to recover it. What had happened was this:

In the idle state, the SMTP server processes were waiting on network input and the router processes were waiting on input from the pipe connected to the SMTP server processes, and everyone was swapped out. When I killed all of the SMTP server processes, all of those pipes suddenly saw end of file, so the kernel woke up all of the router processes and immediately started trying to swap them all back into memory in order to run them. Since a thousand odd router processes did not even remotely fit into memory, the machine immediately started thrashing itself to death.

This makes a great illustration of the need to kill processes in the right order when recovering an overloaded system. You need to kill processes in the order that will produce as little system activity as possible; as this example shows, the last thing you want to do is kill one bunch of processes only to cause this to wake up another bunch of previously idle processes.

(Since kill does not kill all the processes on the command line at once, 'kill -9 `pgrep smtpserver router' is not an entirely safe approach; you are betting that kill will get everything before the kernel interrupts it to start paging router processes back into memory.)

Written on 30 December 2007.
« SNI doesn't work in practice
There are really two GPL v2 licenses »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 30 23:34:47 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.