One problem with testing system changes

June 7, 2010

One of the strange things about system administration as compared to development is the general lack of testing that sysadmins do. I believe that one reason for this is that sysadmins have a hard time testing changes, especially on a budget.

Now, I will admit that I have a biased viewpoint on this; I work in a relatively complex environment (although one that's fairly small by the standards of large systems). As is common in multi-machine environments, we effectively have hierarchies of machines and systems, with a small number of core machines and then more and more machines as you move outwards.

In order to do system-level testing, you need test machines. More than test machines, you need a test environment, something where your changes can be isolated from your production environment. Testing changes at the periphery of our hierarchies is generally easy, because nothing depends on peripheral machines (or services) and thus changes only affect them and only have to be tested on them; you can easily set up a test machine, make a change just on it, and see if it works.

(Well, in theory. In practice even peripheral machines can be quite complex in their own right, offering what is in effect many services.)

But the more interesting and dangerous changes are usually nearer the center and thus have downstream effects on the systems 'below' them. In order to thoroughly test these changes, you need not just a test machine that duplicates your production machine, you need a duplicate of the downstream environment too. The more central the service you're testing a change to, the more infrastructure you need to duplicate even if you miniaturize it (with fewer machines than in your production environment).

(By the way, I'm not convinced that virtualization answers all of the problems here. Hardware differences do affect how systems behave, and virtualized hardware is different from real hardware (even once we set aside speed and load testing issues).)

In the extreme, fully testing changes before deploying them requires a somewhat miniaturized but faithful test version of your entire infrastructure, in order to make the test environment good enough that you will really catch problems before you go live. This is, as a minimum, a pain.

(There is also a tension due to the fact that for sysadmins, every difference between the production environment and the test environment is a chance for uncaught errors to creep in, yet too much similarity between them (even on peripheral machines) can complicate attempts to share elements of the overall infrastructure. The classical case of this is testing firewall changes.)

(This is a very slow reaction to On revision control workflows, which was itself a reaction to an entry of mine.)


Comments on this page:

From 143.48.3.57 at 2010-06-07 12:54:03:

Since I came to system administration from the other side of the fence, I've always had a keen fascination with the similarities and differences with the way that software developers and system administrators (and the project managers who herd them) go about their jobs. In particular, I find it amazing that neither role generally has a good grasp of how the other functions, and how they can better work together.

I think that an interesting part of this problem is that software developers have a much easier time of testing things than system administrators do. For everyone to understand my viewpoint, I need to qualify it by saying that when a system administrator needs to test a new release of software before deploying it to a production system, it's generally not to make sure that any new features introduced in the software are bug-free, because it's simple enough to just document the problems and not use those features until they've stabilized. Rather, the issue is that we need to identify regressions in pieces of code that used to work fine and are now broken.

In the software industry, this is what unit testing is for. Unit testing allows developers to provide a comprehensive set of test cases for a particular function, and make sure that the method works properly for all of them and returns the expected result. Many agile developers believe in writing tests first, then code, and aiming for 100% test coverage to minimize unintended regressions from rapid code changes.

I'm not recommending that system administrators should automate testing other people's software, because there's no standardized model for business requirements. However, I do think that a little transparency into the development model of our upstream developers would help us to figure out where testing is and isn't necessary.

While it's not adopted across all of the software industry, unit testing is very popular in many rapid development scenarios, and has become more or less institutionalized in certain developer communities like CPAN. If you're a developer, or at least, if you develop software without gluing together huge numbers of third-party libraries, it's pretty simple for you to gauge regressions in your own software, because you know (or can easily find out) what the test coverage is for your own project. If you have really thorough unit test coverage, and your test cases are properly written, you shouldn't have any function/method-level regressions slipping into production code when there's an update. This doesn't give the developers a ton of insight into the complex problems, like integration-level or system-level issues, but at least it provides a basic understanding that no minor and insidious issues are creeping up the chain and causing undetected problems.

The problem with unit testing is that the developers run the tests, and they run them on their own systems. This methodology can lead to some really bothersome problems for other people.

When you're a system administrator, and especially if you're a system administrator who deals with a lot of proprietary, closed-source software, it becomes very difficult to understand the development methodologies of every single piece of software you plan to update. There's a certain amount of trust that goes into your Linux vendor's ability to not break things like glibc that aren't easily tested. I think the ability to trust a vendor's stability track record is a wonderful thing, but it's something that shouldn't be necessary for system administrators. We should be able to validate the correctness of code on our systems.

There's a constant impedance mismatch and a constant communication gap between developers and sysadmins that needs to be bridged. Software developers need to understand that most sysadmins aren't developers, and we need an easy way to perform basic correctness validation on the software we install, especially if we install it from the developer's packages and aren't running a "make test" or similar during the install process. We need to understand what's being tested, we need to understand the significance of the test coverage, and we need to be able to figure out what does and doesn't warrant further testing. As it stands, all the validation that developers are (or aren't) doing is lost on us, because we don't get a warm-and-fuzzy from tests that someone else is running that we'll probably never get to see.

--Jeff

From 70.113.127.15 at 2010-07-06 14:50:01:

Chris -

I'm a big advocate for virtualization, but it's worth noting that virtualization adds another layer of core services that may need testing during upgrades, etc.

--
Scott Hebert
http://slaptijack.com/

Written on 07 June 2010.
« The quiet death of postmaster@anywhere
Focusing on what you actually need in a program »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 7 01:56:05 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.