2012-02-22
How I can be wrong about the death of sysadmin jobs
In my entry on the death of 'system administration' I included a parenthetical aside where I said that I expected this death to cost a bunch of people their currently well paid jobs. Since then I've gotten some pushback to this, such as Philip Hollenback's tweet:
#devops @allspaw observes that he is hiring *more* people as automation increases: link Discuss! (hey @thatcks)
So let me discuss how I can be completely wrong in what I said about the job loss.
My model of how automation would result in less sysadmin jobs is what I think of as the traditional model of how automation costs jobs: automation allows companies to do the same amount of work with less people and the people who are left are generally in positions that require different and higher skills than before. A bunch of robots replaces a room full of factory workers and leaves behind a couple of production engineers and a robot maintenance technician. This model implicitly assumes that either that the company doesn't have anything else more advanced and productive it wants those replaced workers to do or that they can't be up-skilled to do it.
On the surface we can make this match system administration. We assume that maintaining a company's IT infrastructure requires X man-hours today, that the company has the people for those man-hours, and that automation will replace much of those man-hours with automation-hours instead. The company could do some more with its infrastructure but probably not more (just like the factory could expand production but probably not a lot), and working on the automation mostly needs drastically different skills from the skills to do the non-automated IT work.
When I put it this way it's easy to see how this could be dead wrong. It's even a stereotype that typical sysadmin environments are swamped with work to the point where people are far too busy with day to day activity to step back and take on worthwhile larger-scale projects (to the point where I wrote about how this is a bad thing). These environments are not work-constrained the way the factory model is but are instead staff-constrained; automation means that they have more available staff time and means that they can do more.
(This happens in the factory model too but we pay much less attention to it, because it's considered a good thing; this is when better machinery increases the productivity of the factory workers.)
As for the skills of the job-losing sysadmins, well, that's clearly an assumption too; my model implicitly assumes both that their current skills aren't too useful in the new automated world and that they either can't upskill themselves given the opportunity or won't be given the opportunity by their employers (ie, the employer would prefer to fire the factory workers and hire some engineers instead of having factory workers turn themselves into engineers). Stated this way, I think that there are good reasons to be dubious about this. Depending on how fast the automation develops and what sort of automation it is, many current sysadmin skills may transfer without problems; to recast what I wrote in my entry about one downside of automation, if you already know how to configure Apache all you need to learn in the new automated world is something about Chef or Puppet.
(There are ways that this still can go off the rails, per WhatWillKillSysadmin. But even then you just need more skills.)
At a conservative minimum, this contrary viewpoint makes my initial gloom about future jobs somewhat pessimistic and overdone. In the optimistic view, system administration is now at the point in its development where the textile factory workers start getting things like electric sewing machines; our productivity is about to start really going up from having better tools and this will benefit everyone.
(I still think you should have a talk about this issue with any junior people you have. Whatever exactly comes to pass in the future, I don't think that system administration is going to go on unchanged. And I remain a strong believer in 'upskilling' into being able to program, for all sorts of reasons.)
PS: I continue to stand by my main argument that it will be a great thing when we stop having to install machines by hand, configure Apache yet again, and so on, because all of that will be reliably automated. As lots of people have said, that frees us up to do more productive and interesting work.
Sidebar: Answering 'what if everything moves to the cloud?'
One of the doom scenarios for sysadmins is companies moving all of their servers to the cloud and getting rid of most of the sysadmins that used to maintain them. However, for this to result in a reduction in sysadmin jobs instead of a transfer of jobs to cloud computing companies we need to assume that companies aren't now going to want to do more computing.
More precisely, we need to assume that the growth in extra demand for (cloud) computing is less than the efficiency gain (and thus the staff reduction) that the cloud computing vendor gains from running at large scale. (We already know that running at large scale can be done more efficiently than at a small scale; there are lots of examples of scaling up computing significantly without scaling up sysadmin staff as much.)
(There's an additional assumption embedded in the cloud gloom scenario too, but I'll let my readers find it and argue over it.)
2012-02-18
The downside of automation versus the death of system administration
Back in AutomationDownside I discussed how one downside of automation was that either you had to spend time learning all of the extra layers it introduced or you'd become a push-button monkey. There's a consequence of this that I didn't mention back in the entry.
This push-button monkey status is the silent downside of the future death of system administration that I sketched out recently. All of those sysadmin-less developers doing their own deployments from canned recipes aren't going to know what's really going on in all of the layers if something goes wrong. This is fine as long as everything works, but when things go off the rails, well, you have issues.
(This is not just an issue of plain lack of knowledge, either, or to put it another way the lack of knowledge is a feature. One point of this is to save the developers from having to spend the time to learn all of the specialized knowledge that's needed to understand the full stack.)
I wouldn't count on this to save your regular sysadmin job, though. If this future comes to pass, things are going to work most of the time and most of the time when they don't work the developers are going to be able to figure it out on their own fast enough (even if it's not as fast as a sysadmin would). Many fewer places are going to be big enough that things are going wrong so frequently that a full-time 'sysadmin' who understands the full deployment stack makes sense. Especially in the constrained environment of a small company, people will make do and if things blow up every so often that's okay as long as they don't blow up too badly.
(You might question the idea that canned automation will work right most of the time, but I think that it will in specific environments such as deploying to a given cloud setup. And to a large extent the degree that my sketch of the death of system administration comes to pass depends on how routinely reliable such pre-written recipes are.)
Traditional sysadmins will probably be horrified at the mistakes that will result from people not knowing all of the fine details and charging ahead anyways. But on a pragmatic level most of the resulting problems won't and don't matter very much over the long run (although they'll be awkward and embarrassing at the time, just as they are today for the companies that run into them). Especially in a future where automation mostly works, you'll need a real long tail event to seriously damage an otherwise sound company.
(Perhaps people should care more about the possibility of long tail events. But it's a hard argument to make, especially when a company is having to choose between a sysadmin to alleviate a rare risk and another developer to accelerate their growth.)
(I have more thoughts on this area circling in my head, but trying to write some of them down has made it clear that they're not clear yet.)
Sidebar: clarifying what I mean by the push-button monkey stuff
Taken from a comment on AutomationDownside:
Or you could be in a situation where all you need to know to configure Apache is Apache configuration, but just do it at this particular host/path, and the changes will be pushed out to the web server/s in question.
This is exactly the situation where you've been reduced to a push-button monkey. You don't actually understand what's going on; you just know how to achieve certain results. What turns people into push-button monkeys isn't that they don't know what to do, it's that they don't know enough about how things really work to do anything other than push the buttons. In particular, they don't know enough to troubleshoot problems except by rote.
Suppose you put a new version of the configuration into the magic host/path spot but the change you wanted isn't appearing on the web servers (or isn't appearing on some web servers). Unless you understand the automation that distributes the files, you don't know where to start looking for problems or even what problems there might be.
(Well, you might have a troubleshooting checklist that someone has prepared for you. But if it's a problem that hasn't been foreseen, you are once again up the creek.)
2012-02-15
A downside of automation
Right now in the sysadmin world it probably qualifies as heresy to say bad things about the idea of automating your work. But unfortunately for us, there actually are downsides to doing so even if we don't notice them a lot of the time.
The one I'm going to talk about today is that when you automate something, you increase the number of things that people in your team need to know. Suppose that you get tired of maintaining your Apache configuration files by hand, so now you put them in a Chef configuration. You've gone from a situation where all you need to know to configure your Apache is Apache configuration itself to a situation where now you need to know Apache configuration, using Chef, and how you're using Chef to configure your Apache. Any time you automate you go from just needing to know one thing, the underlying thing you're dealing with, to needing to know three or so; you still need to know the underlying thing, but now you also need to know the automation system in general and how you're using it in specific.
(You can condense this by one layer of knowledge if you're not using a general automation system, because then the last two bits condense to one. But you probably don't want to do that.)
This can of course be compounded on itself further. Are you auto-generating DHCP configurations from an asset database and then distributing them through Puppet? Well, you've got a lot of layers to know about.
Some people will say that you don't need to really know all of these layers (especially once you reach the level of auto-generated things and other multi-layer constructs). The drawback of this is that not knowing all of the layers turns you into a push-button monkey; you don't actually understand your system any more, you can just push buttons to get results as long as everything works (or doesn't go too badly wrong).
All of this suggests a way to decide when automation is going to be worth it: just compare the amount of time that it'll take for people to learn the automation system and how you're using it with how much time they would spend doing things by hand. You can also compare more elaborate automation systems to less elaborate ones this way (and new elaborate 'best practices' systems to the simple ones you already have).
(One advantage of using a well known automation system such as Chef or Puppet is that you can hope to hire people who already know the automation system in general, cutting out one of the levels of learning. This is also a downside of having your own elaborate automation system; you are guaranteed that new people will have to learn it.)
By the way (and as is traditional), the people who designed and built the automation system are in a terrible position to judge how complex it is and how hard it is to learn, or even to see this issue. You're usually not going to see the system as complex or hard to keep track of, because to you it isn't; as the builder, you're just too close to the system and too immersed in it to see it from an outside perspective.
PS: Automation can have other benefits in any particular situation that are strong enough to overcome this disadvantage (including freeing sysadmins from drudgery that will burn them out). But it's always something to remember.
(This is closely related to the cost of automation but is not quite the same thing; in that entry I was mostly talking about locally developed automation instead of using standard automation tools.)
2012-02-09
What supporting a production OS means for me
Via Pete Zaitcev I just read FreeBSD and release engineering (lwn.net), which is about some issues people are having with the FreeBSD release schedules. That article will probably be a Rorschach test for at least non-FreeBSD people, because how you react to the ideas in it will probably depend on how you see 'production support'. Before I get into my reactions to it (in another entry), I've decided to write down how I see production support.
To me:
- Full support of a production OS release means that it gets bugfixes,
security fixes, and support for new hardware; the latter is necessary
so that you can actually install it on new servers and machines that
you can or want to buy today.
- Legacy support of a production OS release means security fixes and
major bugfixes, but you no longer need to update it for new hardware
or fix smaller bugs. That it will not necessarily install and run on
current-generation hardware is what makes it 'legacy'.
(Ie legacy support is to keep already-installed machines running, not to let you keep installing new machines.)
There should be some overlap in full support between release X and release X+1, because people are not necessarily ready to start installing new machines with release X+1 the moment it comes out.
A production release should not change the code of existing features and systems beyond genuine bug fixes and security updates (and for new hardware support if you really, really have to do so). I really do mean 'the code of', not just 'the behavior of', because any code change adds the possibility of bugs and incompatibilities. I don't trust OS vendors to not accidentally introduce either of these in code changes, so I want code changes minimized as much as possible.
I don't mind if a production release (in either full support or legacy support) gets genuinely new features and software provided that those features are optional and are written in such a way that there is no possibility that they will change or destabilize systems that do not use them. For example, if you want to add iSCSI target support as some new programs and a new loadable kernel module, knock yourself out and I don't care. But if your new iSCSI target driver requires changes to general kernel code, no, forget it; that's too dangerous.
(Another way to put this is that new features are analogous to new hardware support, except that unlike new hardware support you don't get to change anything that already exists. If you can't add it without changing existing things, you don't get to add it at all.)
I don't think that you should allow optional replacement of existing software with new software versions (eg, an optional upgrade from Apache 1.x to Apache 2.x) because it creates a combinatorial explosion in significant variations on the initial production OS release. Among other issues, you can easily get into a situation where no one actually really supports running the original Apache 1.x version any more because everyone has made the 'optional' upgrade to Apache 2.x, which effectively makes the upgrade mandatory instead.
(The extreme version of this is how Debian stable and unstable used to be, where practically no one in the Debian development community actually used the stable version any more.)
A general point about SSH personal keys
Recently I've seen a number of articles on suggested good ways to use SSH securely and other SSH tricks (unfortunately I can't find URLs to all of them, so I'm not going to try to put any here). As it happens I have a few modest suggestions on this, but before I started I wanted to make a broad meta-point about the use of personal SSH keys, aka SSH identities.
The big thing to understand about all advice about SSH personal keys is that when you choose to use personal keys for your own logins, you are deciding to balance convenience with security. After all, if security was your primary concern you would not use personal keys at all; you would use one time passwords with two-factor authentication.
(Things are different for cron'd scripts and the like, when there is no human there to interact with the system. I'm purely talking about using SSH identities to avoid typing passwords.)
Now, everyone has different views of the amount of security that they need and the convenience that they want. People fall along a spectrum between the two poles and where you wind up is not necessarily where I do. Thus, people's security advice about personal keys is not necessarily right for you even if it's correct (in some sense). The trick is to understand your particular tradeoffs and circumstances, to figure out what irritates you and what you need, and then to pick what works for you rather than blindly following someone else's suggestions and being either frustrated or dangerously insecure (in your environment) or both.
Yes, some things will make you less secure than others but they can also be more convenient (and vice versa). Sometimes this is the right tradeoff for you and sometimes it is not (even if it's the right tradeoff for me or whoever you're reading). And yes, there are some SSH tricks that usually increase both security and convenience. These are excellent things to know when you can find them.
(Sadly, my suggestions to come are not of this nature.)
PS: as always when you consider security related issues, you want to think about not just security in the abstract but security in the concrete in your environment with your risks.
2012-02-05
My view on what will kill 'traditional' system administration
Phil Hollenback recently wrote DevOps Is Here Whether You Like It Or Not, in which he writes that traditional system administration is dying. While I sort of agree with him about the death, I don't think it's necessarily for the reasons that Phil points to.
Fundamentally, there has always been a divide between small systems and large systems. Large systems have had to automate and when that automation involved applications, it involved the developers; small systems did not have to automate, and often do not automate because the costs of automation are larger than the costs of doing everything by hand. Moving to virtualization doesn't change that (at least for my sort of system administration, which has always had very little to do with shoving actual physical hardware around); if you have only a few virtualized servers and services, you can perfectly well keep running them by hand and it will probably be easier than learning Chef, Puppet, or CFEngine and then setting up an install.
(If you're future-proofing your career you want to learn Chef or Puppet anyways, so go ahead and use them even in a small environment.)
There are two things that I think will change that, and Phil points to one of them. Heroku is not just a virtualization provider; they are what I'll call a deployment provider, where if you write your application to their API you can simply push it to them without having to configure servers directly. We've seen deployment providers before (eg Google App Engine), but what distinguishes Heroku is how unconstrained and garden variety your API choices are. You don't need to write to special APIs to build a Heroku-ready application; in many cases, if you build an application in a sensible way it's automatically Heroku-ready. This is very enticing to developers because (among other things) it avoids lockin; if Heroku sucks for you, you can easily take your application elsewhere.
(This has historically not been true of other deployment providers, which makes writing things to, say, the Google AppEngine API a very big decision that you have to commit to very early on.)
Deployment providers like Heroku remove traditional system administration entirely. There's no systems or services to configure, and the developers are deeply involved in deployment because a non-developer can't really take a random application and deploy it for the developers. If there is an operations group, it's one that worries about higher level issues such as production environment performance and how to control the movement of code from development to production.
The other thing is general work to reduce the amount of knowledge you need to set up a Chef or Puppet-based environment with certain canned configurations. Right now my impression is that we're still at the stage where someone with experience has to write the initial recipe to configure all N of your servers correctly, and you might as well call that person a sysadmin (ie, they understand Apache config files, package installation on Ubuntu, etc). However it's quite possible that this is going to change over time to the point where we'll see programs shipped with Chef or Puppet recipes to install them in standard setups. At that point you won't need any special knowledge to go from, say, writing a Django-based application to installing it on the virtualization environment of your choice. This really will be the end of developers needing conventional sysadmins in order to get stuff done.
The general issue of the amount of hardware in a small business (and virtualizing the hardware) ties into a larger question of how much hardware the business of the future is going to need or want, but that's a different entry. I will just observe that the amount of servers that you need for a given amount of functionality has been steadily shrinking for years.
Sidebar: what virtualization does change now
I think that plain virtualization does mark a sea change today in one way: it moves sysadmins away from a model of upgrading OSes to a model of recreating their customizations on top of a new version of the OS. Possibly it moves away from upgrading software versions in general to 'build new install with new software versions from scratch, then configure'.
This is partly because the common virtualization model is 'provide base OS version X image, you customize from there' and partly because most virtualization makes it easy to build new server instances. It's much easier to start a new Ubuntu 12.04 image than it is to find a spare server to use as your 12.04 version of whatever.
(Note that virtualization may not make it any easier to replace your Ubuntu 10.04 server with a new 12.04 server; there are a host of low level practical issues that you can still run into unless you already have a sophisticated management environment built up.)
I don't think that this is a huge change for system administration, partly because this is pretty how much we've been doing things here for years. We basically never upgrade servers in place; we always build new servers from scratch. Among other things, it's much cleaner and more reproduceable that way.