Maybe we should explicitly schedule rebooting our fleet every so often
We just got through a downtime where we rebooted basically everything in our fleet, including things like firewalls. Doing such a reboot around this time of year is somewhat of a tradition for us, since we have the university's winter break coming up and in the past some of our machines have had problems that seem to have been related to being up for 'too long'.
Generally we don't like to reboot our machines, because it's disruptive to our users. We're in an unusual sysadmin environment where people directly log in to or use many of the individual servers, so when one of them goes through a reboot, it's definitely noticed. There are behind the scenes machines that we can reboot without users particularly noticing, and some of our machines are sort of generic and could be rebooted on a rolling basis, but not our login servers, our general compute servers, our IMAP server, our heavily used general purpose web server, and so on. So our default is to not reboot our machines unless we have to.
The problem with defaults is that it's very easy to go with them. When the default is to not reboot your machines, this can result in machines that haven't been rebooted in a significant amount of time (and with it, haven't had work done on them that would require a reboot). When we were considering this December's round of precautionary, pre-break rebooting, we realized that this had happened to us. I'm not going to say just how long many of our machines had gone without a reboot, but it was rather long, long enough to feel alarming for various reasons.
We're not going to change our default of not rebooting things, but one way we could work within it is to decide in advance on a schedule for reboots. For example, we could decide that we'll reboot all of our fleet at least three times a year, since that conveniently fits into the university's teaching schedules (we're on the research side, but professors teaching courses may have various things on our machines). We probably wouldn't schedule the precise timing of this mass reboot in advance, but at least having it broadly scheduled (for example, 'we're rebooting everything around the start of May') might get us to do it reliably, rather than just drifting with our default.
Comments on this page:
|
|