We rebooted all of our servers remotely (more or less) and it all worked

September 26, 2020

Even under normal circumstances, we don't routinely reboot our Linux servers. Reboots are disruptive to our users (especially to the people who are logged in to the servers that reboot), and local policies require us to schedule an after-hours downtime for large scale user visible things like this, which is disruptive to our lives. We do reboot them periodically, either for significant enough Ubuntu kernel security issues or just because we want to get them back on up to date kernels. However, all of this is under normal circumstances, when we are actually physically in the office to deal with machines that fail to go down or come up cleanly.

The current situation is not normal. We've been out of the office since early March, and even in March our systems had been rather a while since a reboot (through the magic of our Prometheus metrics and dashboards, I can tell you that at the end of March 12th, the last day I was in the office, most of our systems had been up for about 259 days). Since we were out of the office, we didn't even think about rebooting for a very long time, and by early September many of our machines had been up for over 400 days without a reboot. Things reached a critical point and we (by which I mean my co-workers, as I was on vacation) decided that we should take the risk to reboot everything, while taking some steps to mitigate the risks for very important machines.

(Said steps being that the reboot of those machines was scheduled for early morning, when a co-worker who is an extreme morning person would stop by the office.)

What happened was, well, nothing. Everything rebooted quietly, everything came back up again without problems, and I believe that the co-worker in the office didn't need to do anything. The not as user visible machines that we rebooted before hand all worked, the user visible machines that we rebooted during the downtime worked too, and none of our fears came to pass.

(Well, we did discover a machine or two with odd BIOS settings that caused problems, but they weren't particularly user visible machines; they were generic machines in our SLURM cluster.)

It would be nicer to have remote power control and a KVM over IP setup for all of our machines, so that we could deal with everything from home; that would make reboots almost completely risk free (and an unexpected hardware failure is hard to deal with during a scheduled downtime anyway). But it's reassuring to have a positive experience even just with the basics. It will also probably encourage us to do it again, sooner than last time around.

(Also, it's nice for a potentially risky operation to just work. A quiet day is its own reward, and sometimes our small successes deserve a little celebration.)

Written on 26 September 2020.
« Using SPF on HELO/EHLO hostnames is repurposing SPF to validate a different thing
Remote power control for your machines comes in two flavours »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 26 23:21:37 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.