2012-11-29
One reason why having metrics is important
Put simply, one of the reasons that having metrics is important is that metrics give you a backup check on changes that you think are harmless.
In many environments you can't exhaustively check the effects of every change that you do. Trying to do performance checks all of the time is simply both too time consuming and too much like monkey-work (and people tune out of monkey work). Sooner or later you'll start deciding that some changes are safe enough that you can skip some or all of your checks, or you'll just let them slip because there's something that seems more urgent right now but you'll get to them later (honest).
(The more work your checks are to do and the more genuinely harmless changes you make, the sooner this happens. Humans have a quite strong drive to avoid useless and pointless work.)
The advantage of automatic, constant collection of metrics is that all of this is handled for you, whether or not you think you need it and whether or not you remember (and can be bothered). It even happens without you having to do anything. This is in a sense not as good as explicit performance checks (which may give you more information and which you're going to look at right away, not maybe later) but it's a lot better than nothing and often this is the real choice.
I know that I'm late to the party on this, but sometimes it takes a while for things to sink through my skull.
(Perry Lorier noted effectively this in a comment on yesterday's entry.)
As an obvious side note: this is of course closely related to the benefits of automatic (and fast-running) unit and other tests for programmers. In both cases we're trying to make something automatic and cheap instead of manual and expensive so that it will get done all the time no matter what instead of being at the mercy of people's whims. What sysadmins do is less amenable to unit tests but more amenable to constant live monitoring, so we can get the same effects (especially if it's combined with alerting, as noted by Perry Lorier).
2012-11-28
When you make a harmless change, check to make sure that it is
As I mentioned before, our recent disk performance issue has been good for a number of learning experiences. One of the most painful ones (for me) comes from the fact that this was actually a self-inflicted performance problem. You see, we did not start out using two different sorts of switches on our two iSCSI networks. Initially both networks used the lower-end switches until at one point where for various reasons we decided to swap one out for the higher end and highly trusted core switch. After the change everything appeared to work fine and because we were sure it was a harmless change we didn't try to do any performance tests.
Let me repeat that, rephrased: because we thought this was a harmless change, we didn't check to make sure that it really was. And it turned out that we were very wrong about it; our harmless change of switch models led to a quiet but significant degradation in disk IO performance that lasted for, well, a rather long time. Had we looked at performance numbers before and after the change we might well have discovered the problem immediately.
(Even crude metrics might have been good enough, like the amount of time it took to do backups or ZFS pool scrubs.)
This is one of the corollaries of our inability to understand performance in advance. In real systems, we don't actually know in advance that something's going to be a harmless change that has no effect on the system. The best we can do is make educated guesses and they're sometimes wrong. So the only way to actually know that a change is harmless is to make the change and then measure to verify that no important performance characteristics have changed.
(In fact this is a narrow subset of the general rule that we can't necessarily predict the effects of changes in advance. We can come up with heavily informed theories, but per above they can be wrong even under near-ideal situations. In the end the proof is in the pudding.)
One direct lesson I'm taking from this is that any future changes to our iSCSI and fileserver infrastructure need to have a bunch of performance tests run before and afterwards. More than that, we should be periodically checking our performance even when we think that nothing has changed; as we've seen before, things can change without us doing anything.
(There's some broader things that this is teaching me, but I'm not going to try to put them in this entry (partly because when I start to do so the clear view I have of them in my head slides away from me, which is a good sign that they're not actually clear enough yet).)
2012-11-15
A learning experience: internal mail flow should never be allowed to bounce
The university runs a central email system for all undergraduates. Last week that system started bouncing incoming email, and in doing so it taught me an uncomfortable lesson that I now need to apply to our own mail environment.
You see, the university doesn't actually run this email system; almost
all of it is outsourced to a third party email provider. While the
undergraduate email domain is MX'd to university machines, they're
just a relay; they immediately shuffle incoming mail off to the outside
provider, who stores it and provides access to it and so on. The piece
that broke down last week was the relaying step; the domain name the
university relays to stopped resolving and so the relay machines started
bouncing email with errors about 'unresolvable destination <blah>'.
The problem with bouncing email here is that this was not normal SMTP mail (where failure is routine and so on). This was mail flowing between two internal components using SMTP as the transport protocol and it was never supposed to fail. If some piece of your internal mail flow fails, it's an internal problem. Bouncing mail on these failures turns internal failures into external ones.
In short: failures of internal mail flows should never produce bounces, even if your internal mail flows are done by having regular mailers send messages back and forth via SMTP. If there is an internal failure, what you want to happen is for the messages involved to be preserved somehow (either frozen in place or moved out of the way). Then when the problem is resolved, you can revive the affected messages and have them continue on (just delayed).
This sounds obvious and you may all be nodding along sagely, but guess what our own mail system doesn't do? Our mail system has internal flows just like the central undergrad email system and all of them are susceptible to this problem. If something goes wrong in our internal mail flow, we too will bounce messages and lose email in the process.
(In addition parts of our email system specify the next-hop flow destination by name instead of by IP address, so we are one DNS issue away from an explosion.)
The embarrassing thing about this for me is that this should not be a new observation. We (and by that I mean 'I') have actually fumbled the internal flow of our mail system in the past, leading to a not insignificant amount of bounced email. But the stupidity of the whole 'should never happen problem in the mail system internals causing user-visible bounces' situation did not strike me at the time for whatever reason.
(I think it's partly because at the time I was thinking of my failure as a general mail system configuration mistake, and it's very hard to avoid significant failures there from causing bounces. Only now did I think about the specifics of a failure during an SMTP-based handoff and why this results in user-visible bounces.)
PS: to make it extremely explicit, I don't think that the people responsible for the central undergraduate email system are stupid for missing this and having email bounce on them. As I mentioned, I missed this too despite having it smack me in the face at one point. This could have been us and more or less was us in the past; that's why it's an uncomfortable lesson.
2012-11-08
Devops, the return of system programmers?
Here's a thought I've been turning over in my mind for a while now: in the right light, you can see the Devops movement partly as the return of system programmers, who have been out in the wilderness for a while due to the predictable trajectory of the field.
Now, I've got a limitation here in that as an outside observer, I'm not sure that I really understand what 'Devops' is. But part of it certainly seems to be an increased focus on tooling, among other high-level work. When Devops places have people who focus strongly on operational issues, those people seem to do relatively little traditional system administration and much more developing things like stats gathering daemons, graphing dashboards, and various sorts of automation systems (of course, it's possible that this is just what they talk about in public as the interesting stuff). As system-level programming this sort of thing is solidly in the old system programmer mold.
There's also another side of this. One view of the entire Devops movement for system administrators is that system administrators need to upskill themselves. A significant part of that upskilling is more or less explicitly a move into programming, and 'a system administrator who spends most of their time programming' is effectively a recreation of a system programmer (assuming that their programming is focused on system management; if not, they've become a straight developer).
Personally, I don't mind this at all. If nothing else Devops has made developing infrastructure cool again; as someone who likes programming system tools I'm all in favour of that.
(To be clear, Devops is not just the return of system programmers. It covers a lot more than that, including things that are directly against attitudes from the old days. If Devops is in part the return of system programmers, it is as a side effect of more important shifts.)
2012-11-03
Another go-around on the drawbacks and balances of automation
Today, Ben Cotton tweeted:
When I see a #sysadmin say "I don't use any configuration management", I mentally add "because I'm allergic to competence."
I have a complex reaction to things like this and after a small conversation I condensed some of my opinions to this:
@thatcks: My overall view on #sysadmin automation is that automation adds complexity as well as removes it. How each side balances out varies.
@thatcks: Before you automate, you manage your systems. After you automate you manage the automation (new work) and your systems (hopefully easier).
This deserves a bit more space than Twitter allows it.
All automation needs at least some management and adds some complexity. When you add automation, now you need to spend at least a little bit of time keeping the automation itself running and maintained; this was time that you did not need to spend before. This is ultimately because automation is software, not magic. But at the same time automation means that it takes less time and less work to maintain your systems (at least to the same level of quality), which is why people are driven to automation as their systems grow in size.
These two effects push against each other. How they balance out is not set in stone; it depends on the particular circumstances of your systems, on how much time managing the automation takes versus how much it reduces your system management workload. If you were previously logging in to a hundred servers to make changes on each, automation is going to reduce your workload a lot more than if you were doing this on ten servers. Or two. Similarly, automation that is a pain to keep going raises the gain threshold while easy automation lowers it. If your automation is so complex and fragile that you need a full-time person just to keep it going, you'd better be saving a lot of time with it (generally because you have lots and lots of systems). This degree of automation would be what we call 'overkill' for smaller systems, but when we say 'overkill' what we really mean is that it would take too much work to run for the system management gains it brings in practice.
The gains from improving system management generally scale based on how many systems you have, how often you change them, and how complex those changes are. Large, constantly changing environments are the best case; small static environments are the worst one. The cost of automation varies widely but generally not on any predictable basis, since a great deal depends on both software quality and the fit between the software and what you need to do. The small exception is that automation systems often have a target environment size that they're designed for and using them outside of a right-sized environment is extra-expensive.
(This goes both if you use a large-environment system in a small environment and if you try to make a small-environment system work in a large one.)
Sidebar: apples to apples comparisons with manual management
Some people will say that automation is automatically superior to managing your systems by hand because it insures that everything is repeatable and documented. This may be true in practice, but it is not required in theory; in theory you can have carefully and thoroughly documented (and tested) manual management processes. It just takes more time.
The flipside of this is that if you're going to talk about how little time managing systems by hand takes and thus how little automation would save you, you need to factor in this. It's only an apples to apples time comparison if you're spending that time to make your manual processes just as good as the automated ones.
(You can do this. We (mostly) do. But it undeniably takes extra time and work over plain ordinary manual system management, and it's easy to slip up.)