Linux can be really stable under the right circumstances

October 5, 2016

We don't think about our iSCSI backends all that often. Really, we don't think about them at all. They're just kind of there, sitting quietly in racks and quietly working away. They haven't even sent in any SMART complaints about their data disks yet (although I'm sure that'll start happening in another year or two, unless we got really lucky or unlucky with these HDs).

Recently, though, we got email from the IPMI monitoring on one and as a result I wound up logging in to it. This caused me to notice just how long the production iSCSI backends have been up: from 557 days for the hot spare backend to 726 days for a pair used by one fileserver. As it turns out, this uptime is not arbitrary; it dates back to our forced switch from 10G to 1G networking, when we put 1G cards into everything in our fileserver infrastructure. They've been running untouched (and trouble-free) since then, faithfully handling what has undoubtedly been tens or hundreds of terabytes of IO by now.

Of course you can't get this kind of extreme stability if you change things like kernels so yeah, we haven't been. By now there's a whole collection of CentOS 7 updates that they don't have, which is okay (in our view) because these machines are appliances. We have them working and we have them locked down, and we like them just the way they are now. Based on our past experience with the previous generation of backends, they'll probably stay like this until they're decommissioned.

(This is really the rigid tradeoff of uptime; to get a high uptime, you can't touch things even when maybe you should. We shouldn't worship uptimes as a fetish; high uptimes are merely one means of achieving a goal, and avoiding reboots can sometimes cause problems. But for these machines, not touching them (including not rebooting them) is currently the easiest way to achieve our goal of an extremely stable fileserver environment.)

With all of that said, I have to admit that there's something in me that likes seeing large uptimes, especially on Linux machines. It's been a long time since I ran anything that normally got that sort of uptime and it's nice to see it once again, even if I know the cost of getting there.

(My workstations will never get that kind of uptime any more, because getting that kind of uptime requires being well behind the times. Two or three years is a long time in software releases of things that I like.)

Comments on this page:

By Arnie at 2016-10-16 01:06:41:

What a ridiculous article. I'm running 4 instances of Linux (2 servers, 1 desktop, 1 laptop: CentOS/Fedora/Ubuntu/Mint) and they are all stable. In fact, that is one of the draws of Linux, is it's more stable and more secure than Windows.

By Miksa at 2016-10-19 04:15:53:

@Arnie, you seem to have better luck than me. On my Fedora 24 computer the past couple times when I've run normal dnf update, KDE has crashed or shut down and restarted, and I've lost all my running programs. And setting up my triple-head monitors has been bigger hassle than ever on Windows.

But I believe you and Chris are talking about different kind of stability. You mean stability in the sense of not crashing, but I understood Chris' stability as "system continuing to behave in expected manner for extended period of time". And updates can certainly be a hazard for this kind of stability based on my experience of administering hundreds of RHEL servers, and RHEL is about the most stable Linux distribution that exists.

Maybe there is an update to Java that changes some configuration setting to the default and the service depending on Java doesn't work with the default setting. Or maybe there is an update to Abrtd, but the new package is missing a critical dependency of some Python package. So the next time some minor app crashes and launches abrtd, the abrtd also crashes, launches abortd and so on, and ends up sending thousands of emails to root. We noticed that problem when email admins complained.

Written on 05 October 2016.
« My take on Git rebasing versus cherry-picking
How we could update our iSCSI backends and why we probably won't »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Oct 5 00:56:45 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.