The downside of automation versus the death of system administration

February 18, 2012

Back in AutomationDownside I discussed how one downside of automation was that either you had to spend time learning all of the extra layers it introduced or you'd become a push-button monkey. There's a consequence of this that I didn't mention back in the entry.

This push-button monkey status is the silent downside of the future death of system administration that I sketched out recently. All of those sysadmin-less developers doing their own deployments from canned recipes aren't going to know what's really going on in all of the layers if something goes wrong. This is fine as long as everything works, but when things go off the rails, well, you have issues.

(This is not just an issue of plain lack of knowledge, either, or to put it another way the lack of knowledge is a feature. One point of this is to save the developers from having to spend the time to learn all of the specialized knowledge that's needed to understand the full stack.)

I wouldn't count on this to save your regular sysadmin job, though. If this future comes to pass, things are going to work most of the time and most of the time when they don't work the developers are going to be able to figure it out on their own fast enough (even if it's not as fast as a sysadmin would). Many fewer places are going to be big enough that things are going wrong so frequently that a full-time 'sysadmin' who understands the full deployment stack makes sense. Especially in the constrained environment of a small company, people will make do and if things blow up every so often that's okay as long as they don't blow up too badly.

(You might question the idea that canned automation will work right most of the time, but I think that it will in specific environments such as deploying to a given cloud setup. And to a large extent the degree that my sketch of the death of system administration comes to pass depends on how routinely reliable such pre-written recipes are.)

Traditional sysadmins will probably be horrified at the mistakes that will result from people not knowing all of the fine details and charging ahead anyways. But on a pragmatic level most of the resulting problems won't and don't matter very much over the long run (although they'll be awkward and embarrassing at the time, just as they are today for the companies that run into them). Especially in a future where automation mostly works, you'll need a real long tail event to seriously damage an otherwise sound company.

(Perhaps people should care more about the possibility of long tail events. But it's a hard argument to make, especially when a company is having to choose between a sysadmin to alleviate a rare risk and another developer to accelerate their growth.)

(I have more thoughts on this area circling in my head, but trying to write some of them down has made it clear that they're not clear yet.)

Sidebar: clarifying what I mean by the push-button monkey stuff

Taken from a comment on AutomationDownside:

Or you could be in a situation where all you need to know to configure Apache is Apache configuration, but just do it at this particular host/path, and the changes will be pushed out to the web server/s in question.

This is exactly the situation where you've been reduced to a push-button monkey. You don't actually understand what's going on; you just know how to achieve certain results. What turns people into push-button monkeys isn't that they don't know what to do, it's that they don't know enough about how things really work to do anything other than push the buttons. In particular, they don't know enough to troubleshoot problems except by rote.

Suppose you put a new version of the configuration into the magic host/path spot but the change you wanted isn't appearing on the web servers (or isn't appearing on some web servers). Unless you understand the automation that distributes the files, you don't know where to start looking for problems or even what problems there might be.

(Well, you might have a troubleshooting checklist that someone has prepared for you. But if it's a problem that hasn't been foreseen, you are once again up the creek.)


Comments on this page:

From 70.30.136.128 at 2012-02-20 09:40:51:

Or you could be in a situation where all you need to know to configure Apache is Apache configuration, but just do it at this particular host/path, and the changes will be pushed out to the web server/s in question.

This is exactly the situation where you've been reduced to a push-button monkey. You don't actually understand what's going on; you just know how to achieve certain results.

Writer of the comment here: IMHO, it's reducing to push-button monkey mechanics what is essentially monkey mechanics anyway.

If I have a bunch of web servers, I don't want to have to SSH into each one and 'cd' to the same directory on 'n' machines for the same changes. If my NTP server changes, I don't want to have to log into every machine to update /etc/ntp.conf. Configuring a service if interesting the first one or two times as you debug it, but after that it become boring and a waste of time.

So there are two parts to this: (1) the figuring out of what to put in Apache's config, and (2) getting that setting onto the web servers. Why waste time time typing "ssh websrvN; cd /etc/...; vi foo.conf; [stuff here] ; apache2ctl checkconfig; apache2ctl graceful" a whole bunch of times?

When you write:

Suppose you put a new version of the configuration into the magic host/path spot but the change you wanted isn't appearing on the web servers (or isn't appearing on some web servers). Unless you understand the automation that distributes the files, you don't know where to start looking for problems or even what problems there might be.

You appear to be assuming that the person editing the config file in the "magic" host/path spot does't understand what is happening? Furthermore:

Well, you might have a troubleshooting checklist that someone has prepared for you. But if it's a problem that hasn't been foreseen, you are once again up the creek.

Why would one be up the creek? Even if they don't know the exact details, I would hope that you would have hired people that were smart enough to reverse engineer the behaviour of the system from first principles: which daemon is it (ps -ef)? where is it's configuration (lsof)? what ports is it listening to (lsof; netstat) ? what is the client doing (strace; tcpdump)?

I appears that many of your objections about things becoming "magical" in automation is a concern about the people on the team not understanding what is going on. I think that these concerns can be assuaged by decent documentation, cross-training of the team, and having people that are intelligent enough to figure out what's going on from general first principles debugging techniques.

Lacking one of the above items things can still be okay, but if you're lack two (or all three) then I think one has bigger problems than push-button monkey behaviour.

By cks at 2012-02-21 15:16:32:

You appear to be assuming that the person editing the config file in the "magic" host/path spot does't understand what is happening?

Not quite that. My view is that you have two options: either that person needs to take the extra time to learn not only the config file but also your automation mechanism, or they don't spend that extra time but become a button-pushing monkey.

As you fairly note, it's possible to defer the learning time if and until it becomes necessary; however, this is going to delay people's response time when problems come up. And the more complex your automation is the more they will have to learn (sooner or later) in this model.

In short, there is no magic way to have all of automation, no need to spend time learning the automation, and people who are fully capable of troubleshooting problems. If you have automation you get to pick one of the two other options.

This doesn't mean that automation is intrinsically bad. To get back to my point from AutomationDownside, you can look at the time savings from having automation versus not having automation. As you note, in a situation with a bunch of web servers it's quite likely that automation will save you time overall even once everyone has to learn the automation (and it will certainly be less tedious than having people log into a bunch of web servers or manually push files to them).

From 64.71.1.165 at 2012-02-21 16:52:32:

i think the word "automation" is overloaded the same way the word "monitoring" is.

"monitoring" really means awareness, but it's important to distinguish alerting from trending/analysis from reporting. if you get these mixed up you either have a system that's overwhelmingly noisy or critically undermeasured.

similarly when people talk about "automation" they really mean that they don't want to have to do the same task a bunch of times or around the clock. it's unfulfilling and error prone. but the causes of repitition can vary:

repititiousness can happen if you don't have a good deployment solution so you do a lot of the same change in a bunch of places. this is really more of a scale problem than an automation problem. what you want is a way to copy the same code and config and restart the same services on many targets at once (or on a rolling basis)... this does not call for an automation language but just a way to target and fan out a manual sequence of copy and restart. for this, use pssh and rsync or pscp, don't try to express your efforts in an nth generation language, for all the reasons chris mentions, and because in an agile tech world deployments are often different enough across subsystems and time that the exact sequence varies and investing in fully automating every or any one of them is poor return. manual really is simpler and more technically sound as long as it can be scaled. plus it keeps you more intimate with the system. being able to function in a manual ad hoc way at scale is also useful for responding to crises, and being well rehearsed in this mode is important.

repititiousness can also happen if a system is unstable and requires routine kicking. it happens sometimes in legacy systems that cannot be fixed or replaced. but all too often it happens because an org is not aligned for complete root level solutions. if it's not a contributor's problem to deliver a stable system efficiently operable at scale, it becomes a supporter's problem to inefficiently pick up that slack. an org that wants for a better supporter band aid sounds technically disfunctional and not very fun.

all my tech chops usually end up invested in tools to track, monitor, visualize, communicate, integrate, audit, etc. maybe you could call this automation, but i've never found things like puppet to be much help to this end. i'm probably in the minority here... i'd love to be proven wrong.

Scott Dworkis

From 146.6.208.17 at 2012-02-22 16:58:24:

Your articles depress me.

Written on 18 February 2012.
« Handling modern IPv6 in programming environments
The most popular sender domains for spam messages sent to here »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 18 02:05:54 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.