Perhaps it's a good idea to reboot everything periodically

October 23, 2015

Yesterday around 6pm, the department's connection to the campus backbone had its performance basically fall off a cliff. Packet loss jumped and bandwidth dropped from that appropriate to gigabit Ethernet down to the level of a good home connection (it seems to have been running around 16 Mbits/sec inbound, although somewhat more outbound). People started noticing this morning, which resulted in us running around and talking to the university's central NOC (who run the backbone and thus the router that we connect to).

Everything looked perfectly normal on our side of things, with no errors being logged, all relevant interfaces up at 1G, and so on. But in the process of looking at things, we noticed that our bridging firewall had been up for 450 days or so. Since we have a ready hot spare and pfsync makes shifting over relatively non-disruptive, we (by which I mean my co-workers) decided to switch the active and hot spare machines (after rebooting the hot spare). Lo and behold, all of our backbone performance problems went away on the spot.

We reboot our Ubuntu machines on a relatively regular basis in order to apply kernel updates, because they're exposed to users. But many of our other machines we treat as appliances and as part of that we basically don't reboot them unless there's some compelling reason to do so. That's how we wind up with firewalls with 450 day uptimes, fileservers and backends that have mostly been up since they were installed a year or so ago, and so on.

Perhaps we should rethink that. In fact, if we're going to rethink things and agree to reboot machines every so often, we should actually make a relatively concrete schedule for it in advance. We don't have to schedule down to the day or week, but even something like deciding that all of the firewalls will be rebooted in March is likely to drastically increase the odds that it will actually happen.

('We should reboot the firewalls after they've been up for a while' is sufficiently fuzzy that it is at best a low priority entry in someone's to-do list, and thus easy to forget about or never get to. Adding 'in March' pushes things closer to the point where someone will put it on their calendar and then get it done.)

Comments on this page:

I remember when I ran into a jiffy overflow bug reading a suggestion that the kernel should go ahead and overflow the counter about five minutes after boot in order to help driver developers discover bugs which otherwise manifest only after 50 days or such.

I try to keep my firewalls patched ... so if I go 400 days between reboots I'm doing a poor job.

Written on 23 October 2015.
« CPython's trust of bytecode is not a security problem
How my PS/2 to USB conversion issues have shaken out »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 23 01:56:00 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.