One problem with testing system changes

June 7, 2010

One of the strange things about system administration as compared to development is the general lack of testing that sysadmins do. I believe that one reason for this is that sysadmins have a hard time testing changes, especially on a budget.

Now, I will admit that I have a biased viewpoint on this; I work in a relatively complex environment (although one that's fairly small by the standards of large systems). As is common in multi-machine environments, we effectively have hierarchies of machines and systems, with a small number of core machines and then more and more machines as you move outwards.

In order to do system-level testing, you need test machines. More than test machines, you need a test environment, something where your changes can be isolated from your production environment. Testing changes at the periphery of our hierarchies is generally easy, because nothing depends on peripheral machines (or services) and thus changes only affect them and only have to be tested on them; you can easily set up a test machine, make a change just on it, and see if it works.

(Well, in theory. In practice even peripheral machines can be quite complex in their own right, offering what is in effect many services.)

But the more interesting and dangerous changes are usually nearer the center and thus have downstream effects on the systems 'below' them. In order to thoroughly test these changes, you need not just a test machine that duplicates your production machine, you need a duplicate of the downstream environment too. The more central the service you're testing a change to, the more infrastructure you need to duplicate even if you miniaturize it (with fewer machines than in your production environment).

(By the way, I'm not convinced that virtualization answers all of the problems here. Hardware differences do affect how systems behave, and virtualized hardware is different from real hardware (even once we set aside speed and load testing issues).)

In the extreme, fully testing changes before deploying them requires a somewhat miniaturized but faithful test version of your entire infrastructure, in order to make the test environment good enough that you will really catch problems before you go live. This is, as a minimum, a pain.

(There is also a tension due to the fact that for sysadmins, every difference between the production environment and the test environment is a chance for uncaught errors to creep in, yet too much similarity between them (even on peripheral machines) can complicate attempts to share elements of the overall infrastructure. The classical case of this is testing firewall changes.)

(This is a very slow reaction to On revision control workflows, which was itself a reaction to an entry of mine.)

Written on 07 June 2010.
« The quiet death of postmaster@anywhere
Focusing on what you actually need in a program »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 7 01:56:05 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.