Finally, a good reason to periodically reboot servers

August 20, 2006

Recently, we had an interesting fire drill that actually winds up being the first decent argument for periodically rebooting good servers that I've seen.

We have a number of very important servers, the kind of important servers that are active all the time and that have their downtimes carefully scheduled well in advance. Recently, we (and by this I mean 'a co-worker') had to patch the OS on a couple of them in the pack. This went fine.

As part of patching the OS, you have to reboot the machines. This did not go fine; both servers refused to come up, puking up an obscure error messages. Once pored over and decoded (partly by the vendor's hardware people), the error messages on both machines boiled down to more or less 'the configuration NVRAM is corrupt'.

(The configuration NVRAM had not been touched by the OS patching process.)

This was, naturally, a big problem. A disruptive problem. Emergency bandaids were slapped into place, things were postponed, and hardware maintenance was summoned (and duly fixed the problem).

Of course, the only time anything looks at the configuration NVRAM (and cares that it's corrupted) is when the system is booted. Since these systems are almost never rebooted, we have very little idea how long ago the NVRAM got zapped; it could have been months. Since both systems failed, we're also somewhat nervous about the state of the rest of the pack, which have more or less identical hardware and haven't been rebooted recently. Do they have corrupt configuration NVRAM too?

(Test reboots are now being scheduled.)

Thus, the first decent argument for periodic precautionary reboots: there are bits of hardware that only get exercised when the machine reboots (and that vendors don't expose for testing). If something has gone wrong with one of them, it is better to find out during a scheduled time than as an unpleasant surprise.

This does have an important consequence: because tests can fail, you had better have a plan for what to do if the server won't come up after its precautionary reboot. (For the important machines I'm responsible for, the answer is 'failover to the backup'; naturally, one should never reboot the primary and the backup at the same time.)

As a corollary, it is probably better to schedule precautionary reboots for somewhat before the start of the workday on a weekday morning, so that if something goes wrong all of your vendor's people will soon be around.

(Naturally, we did the OS patching process in the evening, to have a margin of error for software problems.)

(My apologies to my co-workers if I've mangled the story in my retelling of it.)

Written on 20 August 2006.
« Weekly spam summary on August 29th, 2006
How not to set up your DNS (part 10) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 20 23:47:14 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.