Maybe we should explicitly schedule rebooting our fleet every so often

December 9, 2024

We just got through a downtime where we rebooted basically everything in our fleet, including things like firewalls. Doing such a reboot around this time of year is somewhat of a tradition for us, since we have the university's winter break coming up and in the past some of our machines have had problems that seem to have been related to being up for 'too long'.

Generally we don't like to reboot our machines, because it's disruptive to our users. We're in an unusual sysadmin environment where people directly log in to or use many of the individual servers, so when one of them goes through a reboot, it's definitely noticed. There are behind the scenes machines that we can reboot without users particularly noticing, and some of our machines are sort of generic and could be rebooted on a rolling basis, but not our login servers, our general compute servers, our IMAP server, our heavily used general purpose web server, and so on. So our default is to not reboot our machines unless we have to.

The problem with defaults is that it's very easy to go with them. When the default is to not reboot your machines, this can result in machines that haven't been rebooted in a significant amount of time (and with it, haven't had work done on them that would require a reboot). When we were considering this December's round of precautionary, pre-break rebooting, we realized that this had happened to us. I'm not going to say just how long many of our machines had gone without a reboot, but it was rather long, long enough to feel alarming for various reasons.

We're not going to change our default of not rebooting things, but one way we could work within it is to decide in advance on a schedule for reboots. For example, we could decide that we'll reboot all of our fleet at least three times a year, since that conveniently fits into the university's teaching schedules (we're on the research side, but professors teaching courses may have various things on our machines). We probably wouldn't schedule the precise timing of this mass reboot in advance, but at least having it broadly scheduled (for example, 'we're rebooting everything around the start of May') might get us to do it reliably, rather than just drifting with our default.


Comments on this page:

The problem with not rebooting is that your machines are probably also unpatched. I actually have an alert in Zabbix when uptime exceeds 2 weeks.

When I worked in a university department, my general practice (and those of my peers in other departments, as far as I could tell), was to have a planned maintenance window for disruptive upgrades, reboots, etc within a few days after grades were due. This tended to be the time when teaching and research faculty and grad students were most likely to step away for a little bit and catch their breath.

By Miksa at 2024-12-11 09:23:13:

My small group administers a large amount of Linux servers in our university. A coworker has reminisced of the days before we had scheduled maintenance cycles. Trying to get a permission to reboot a server was such a hassle that they decreed that every server will have a monthly window with updates and reboots, and the users shall learn to deal with it. And they have, we get very little complaints about them. If some computation server has a long running job that won't finish in time the users can request skipping the maintenance that month. If your service is too important to have an outage, then it's time to build a cluster.

There's a pair of SSH jump hosts I need for my work and their maintenance is scheduled to midday. So the first thing after clocking in I run a script that checks what day and week number it is and sets my jump host to the one further in the future. The schedules are in a format "Second Tuesday of every month at 10-12 AM"

The last hold out is our SAP environment and it still doesn't have a set schedule, but at least nowadays people have settled on 3-4 maintenances per year.

By Todd at 2024-12-11 14:18:52:

Yeah... it might be time to set a schedule for reboots. Better to find out problems on your own schedule rather than waiting for an unplanned reboot to find out!

By Ricardo Bánffy at 2024-12-13 13:32:52:

One nice approach is to reboot if there is no one logged into the machine. As soon as the uptime crosses a threshold, a cron job can look for interactive users and, if there are none, reboot the machine.

I might just do that in my home lab - I have terminal sessions open to a couple machines all the time, but I log off when I don't need them. This would save me the hassle of finding out when it's time to reboot them (usually during an update).

Written on 09 December 2024.
« Unix's buffered IO in assembly and in C
My wish for VFS or filesystem level cgroup (v2) IO limits »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Dec 9 23:08:53 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.