Why you should support 'reload' as well as 'restart'

October 24, 2012

Suppose that you're writing a network server with persistent connections (these days that can even describe web servers, depending on what sort of web app you're running). Your program should really support able to reload its configuration and related things on the fly, without having to be stopped and restarted. This should include not the configuration alone but anything you normally load once at startup. Especially it should include anything that can time out or expire such as, oh, SSL certificates.

The problem with only supporting configuration and other changes through a full restart is that a full restart usually breaks all of those persistent connections and breaking persistent connections often has undesirable consequences, the kind of consequences that requires system administrators to arrange scheduled downtimes. Reloads don't, not if you do them right.

(Yes, yes, of course your protocol specification says that clients should handle a server disconnect by retrying and resuming things in a way that's transparent to higher layers and the user. If you think that all actual clients do this right I have a bridge for sale that you might find attractive. If you also wrote the only client but haven't carefully and extensively tested it in the face of random disconnects, well, I still wouldn't suggest buying any bridges that you get offered.)

There are two ways to do reloads, more or less: you can wait for an explicit signal or you can simply notice changed things and automatically pick them up. Automatically picking things up is sexy, but speaking as a sysadmin I prefer explicit signals because that avoids issues with half-complete changes. No matter how fast I'm making the changes, with automatic reloads there is always a timing window with half-written files or where only part of the necessary files have been updated.

(As an example, consider updating SSL certificates. These come as two separate objects, a certificate and the private key that goes with it; for correct operation, you need either both new ones present or neither. If your program reloads its configuration partway through you get a mismatched key and certificate.)

Supporting reloading in servers with non-persistent connections is appreciated if you want to do it. No matter how fast a stop and restart sequence is, there's always a time window where the server is not actually running and sometimes this matters.

Supporting reloading is unquestionably more work; the great appeal of 'restart the server to make configuration changes' is that it needs no additional code (you already needed code to shut down cleanly and load the configuration on startup). But it's an important part of creating a system that's manageable and resilient. Real systems have configurations that change over time and they should stay up and available through it.

(This entry has been brought to you by the process of updating SSL certificates across our various systems.)

Sidebar: configuration changes in the face of on the fly reloads

It's possible to make configuration changes reliable in the face of servers that do on the fly reloads. What you have to do is treat everything except the very top level configuration file as immutable once created and activated. If you need to migrate to a new SSL certificate, for example, you don't replace the current certificate file with a new certificate; instead you put the new certificate in a new file and prepare a new version of the top level configuration that refers to that new file instead of the old one. The new configuration is activated by moving the new top level configuration into place (which can be done reliably as a single operation).

(This ought to look familiar. It's the same general approach used by other no-overwrite things such as filesystems and also as one way to both enable and deal with various layers of caching on the web.)

If you have more than one top level configuration file and they aren't totally independent and unrelated, you're out of luck and up the creek. As a corollary, if you're writing a program and are tempting to make it do on the fly reloads please make sure it has only one top level configuration file.


Comments on this page:

From 72.200.82.252 at 2012-10-25 10:58:34:

There are a few other actions that I'd like to see supported in services (or at least the scripts that wrap them), like a "backup/restore" that could work on service configuration files or data.

Those scripts know (or can derive) this information much better and more granularly than whole directory backups of /etc, and it would allow easier change automation and service migration between systems.

Written on 24 October 2012.
« The problem of simulating random IO
Always make sure you really understand what your problem is »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Oct 24 01:41:16 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.