Avoiding reboots should not become a fetish

March 22, 2014

Unix is designed so that you shouldn't normally need to reboot it to fix problems and in most environments it's considered good practice to stick with this and not reboot Unix machines casually, or even very much at all. People have rightfully mocked the approach in other systems of rebooting as a routine troubleshooting step (often an early one, sometimes the first one). Unfortunately it's quite possible and in fact not uncommon to take this attitude too far and make not rebooting into a fetish. The symptoms of this fetish are fairly straightforward; people afflicted by it would rather do almost anything than reboot a machine, no matter how time consuming, obscure, or difficult it is. They will confidently assert that rebooting is never the right answer and is basically always a last resort, done only after you've exhausted other options.

Reality is a bit different. In reality, sometimes rebooting is the right answer even if it is not mathematically speaking necessary (by which I mean 'essential'). In pragmatic system administration, rebooting can be easier, more reliable, or simply more certain in the face of various forms of uncertainty. Ultimately the 'don't reboot' fetish has confused a means with an end.

The real goal is avoiding user and service disruption, or at least minimizing it. Not rebooting machines is a means to this end, since rebooting disrupts everything for a while. Conversely sometimes rebooting actually is the best means to this end because it's the approach that will result in the shortest disruption. For one example, if your system is swapping itself to death due to temporary excessive memory usage you could wait it out (or play the slow game of 'hunt the memory hog when the system mostly isn't responding') or you could reboot. It's highly likely that rebooting will get your machine back into service the fastest, sometimes by hours.

There are many factors that play into your answer in any particular situation, things like how long a particular approach will take to restore the system to service, how much more disruptive it will be than the current or likely future situation, when good and bad times are for disruptions, and whether there are additional issues like gathering information for further troubleshooting. There is no single universal right (or mostly right) answer. Like much system administration, it's situational.

(In fact sometimes rebooting servers randomly is the right approach. But that's not a common environment, or at least not what I think of as a common environment.)

PS: In the spirit of honesty I must admit that this entry was sparked by my feelings about some reddit reactions to a recent entry. Probably I should have heeded the classic xkcd lesson.

Sidebar: rebooting versus going to single user mode

As a side note, to say that rebooting a server is terrible and you should avoid it by bringing the server into single user mode and then returning it to multi-user mode is to miss the forest for the trees. Going to single user mode is almost always just as disruptive as rebooting a server since you terminate all user processes, bring down all services, stop network routing, and so on.

It's also probably significantly more reliable to reboot a server instead of bringing it to single user mode and then back to multiuser mode. The code paths for bringing services up in a just-booted environment are tested all the time, while the code paths for bringing services up (and down) in a multiuser to single user to back to multiuser environment are tested very, very rarely. Are you absolutely confident that everything cleaned up after itself and fully reset all state when going into single-user mode? I'm not.

(If you're confident I certainly hope that you've tested this extensively and carefully for your particular environment. I certainly don't think that your test results can be generalized.)

Written on 22 March 2014.
« Thinking about when rsync's incremental mode doesn't help
Differences in URL and site layout between static and dynamic websites »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Mar 22 00:03:05 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.