Why we care about long uptimes

October 16, 2016

Here's a question: why should we care about long uptimes, especially if we have to get these long uptimes in somewhat artificial situations like not applying updates?

(I mean, sysadmins like boasting about long uptimes, but this is just boasting. And we shouldn't make long uptimes a fetish.)

One answer is certainly 'keeping your system up avoids disrupting users'. Of course there are many other ways to achieve this, such as redundancy and failure-resistant environments. The whole pets versus cattle movement is in part about making single machine uptime unimportant; you achieve your user visible uptime by a resilient environment that can deal with all sorts of failures, instead of heroic (and artificial) efforts to keep single machines from rebooting or single services from restarting.

(Note that not all environments can work this way, although ours may be an extreme case.)

My answer is that long uptimes demonstrate that our systems are fundamentally stable. If you can keep a system up and stable for a long time, you've shown that (in your usage) it doesn't have issues like memory leaks, fragmentation, lurking counter rollover problems, and so on. Even very small issues here can destabilize your system over a span of months or years, so a multi-year uptime is a fairly strong demonstration that you don't have these problems. And this matters because it means that any instability problems in the environment are introduced by us, and that means we can control them and schedule them and so on.

A system that lacks this stability is one where at a minimum you're forced to schedule regular service restarts (or system reboots) in order to avoid unplanned or unpleasant outages when the accumulated slow problems grow too big. At the worst, you have unplanned outages or service/system restarts when the system runs itself into the ground. You can certainly deal with this with things like auto-restarted programs and services, deadman timers to force automated reboots, and so on, but it's less than ideal. We'd like fundamentally stable systems because they provide a strong base to build on top of.

So when I say 'our iSCSI backends have been up for almost two years', what I'm really saying is 'we've clearly managed to build an extremely stable base for our fileserver environment'. And that's a good thing (and not always the case).

Written on 16 October 2016.
« How I managed to shoot myself in the foot with my local DNS resolver
Making my Yubikey work reasonably with my X screen locking »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 16 23:55:29 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.