2009-06-30
More on why users keep mailing specific people
Perhaps unsurprisingly, most of the people who've commented on my last entry have attributed this behavior to people's desire to get their issues dealt with promptly (to some, this is jumping the queue; what you could call a 'vigorous discussion' has broken out in the comments about this, rather to my surprise). I have a couple of reactions to this view.
First, I am pretty sure that this is not why people here do it, at least for the kind of cases that I'm thinking of. We don't have a trouble ticketing system or the like, just email aliases, and generally the users email the person who was going to deal with their issue anyways; the only effective difference is what email address they use. Hence my belief that our users really do keep emailing specific people because it's easier to remember people than mail aliases.
(From our perspective it matters what they email; when you mail an alias, everyone is in the loop and we have a record of it. But these are internal process issues, not things that the users care about, at least not until the person they emailed is out sick one day. And I actually suspect that they accept that sort of thing happening, because after all they did email a specific person.)
Second, if getting prompt responses is the reason that users are mailing you directly you have at least one problem, to wit either your response times are perceived as too slow or the procedures for going through regular channels are too complicated. If users are also doing it to jump the queue, it is my opinion that you also have either problem users or a significantly dysfunctional organizational environment (at a minimum, one where there is vigorous disagreement over what your priorities should be).
In either case, swatting users on the nose is generally not an effective way to solve your problems (although it is a great way to make them worse). Instead, you need to deal with the root causes, the hard social problems. Sometimes this will be beyond your power; in that case I believe that you need to do the best you can and be as transparent about what is going on as possible.
(If the problem is organizational politics, the last thing you want to do is put yourself in the position of being everyone's chewtoy. Let aggravated people see that it is not your fault, so that they can take their gripes to higher powers. And if you're dealing with problem users, you really want to have management approval of what you're doing; otherwise, you may find out the hard way that the problem users have more power than you do.)
2009-06-29
A theory on why users keep mailing specific people
Like many places, we have several generic aliases that users mail about various issues. And, just like I expect happens everywhere, every so often users don't use those aliases and instead email some specific person here with their issue.
I recently came up with a theory for why this happens: it's easier to remember people (and then their email address) than it is to remember something impersonal. So people remember 'oh, I dealt with <X> last time to fix this', and they don't necessarily remember 'oh, I'm supposed to mail this random address'. And <X> gets more email.
(I am theorizing about this, but we know that humans have a fair amount of brainpower that's devoted to paying attention to other people (and we anthropomorphize like crazy), so it seems at least reasonable.)
Sysadmins may not see this as reasonable, but then I've got to point out that as successful sysadmins we are basically required to be good at memorizing computer-related trivia. Of course we can easily remember various abstract email addresses and keep them straight; we spend all day doing similar things, and we see the email addresses a lot more than the typical person does to boot so they're more familiar to us.
Unfortunately, I can't think of anything useful to do with this theory. It does make me wonder if anyone has experimented with deliberately anthropomorphizing their generic aliases and support systems, and if so if it did any good.
2009-06-26
How not to set up your DNS (part 19)
It's been quite a while since the last installment, but today's is an interesting although simple case. Presented in the traditional illustrated format:
; sdig ns xing121.cn dns1.dns-dns.com.cn. dns2.dns-dns.com.cn. ; sdig a dns1.dns-dns.com.cn. 127.0.0.1 ; sdig a dns2.dns-dns.com.cn. 127.0.0.1
As they say, 'I don't think so'. If you run a caching resolving nameserver that does not have 127.0.0.1 in its access ACLs, this sort of thing is a great way to have mysterious messages show up in your logs about:
client 127.0.0.1#21877: query (cache) 'www.xing121.cn/A/IN' denied
(Guess how I noticed this particular problem.)
Judging from our logs, there seem to be a number of Chinese domains that have this problem (with the same DNS servers), assuming that it is a problem and not something deliberate.
Less straightforward is this case:
; sdig ns edetsa.com. ns1.hn.org. tucuman.edetsa.com. ; sdig a ns1.hn.org. 127.0.0.1 ; sdig a tucuman.edetsa.com. 200.45.171.226
One possible theory is that hn.org no longer wishes to be a DNS server for edetsa.com but can't get edetsa.com's cooperation, so they've just changed the A record for that name to something that makes people go away. (hn.org has real working DNS servers of its own.)
2009-06-25
Patching systems versus patching appliances
We have two categories of machines here; let me call them appliances and systems. Systems are machines that users have access to or that provide complex, general services to the Internet; our login and compute servers, our IMAP server, and so on. Appliances are not user accessible, are more or less hermetically sealed, and generally only expose the specific service that they're supposed to provide.
We keep systems up to date on patches. We have to; they're exposed. But we almost never patch appliances, because not only are they hermetically sealed, they're almost always vital parts of our production environment. Once they work, we want to touch them as little as possible, and applying patches is a change, and it usually doesn't fix a vulnerability that is externally accessible. Generally this means that all we care about is OpenSSH (and OpenSSL) vulnerabilities and vulnerabilities in whatever service the machine provides.
(We don't care very much if there is a way to go from a local user account to root, for example, because everyone who can log in to the machine already has root. Now, this is slightly dangerous, since it means that someone who can get access to one of our accounts can bootstrap to compromising the entire system.)
We entirely don't care about bugfix patches for our appliances, unless we happen to be running into the bug. And generally we aren't, because we do our best to make sure that the machines work before we deploy them. (Sometimes we deploy with known issues and cross our fingers, but we try to avoid that.)
Sometimes this gets annoying. For example, our iSCSI backends are appliances, so every so often Red Hat's management tools send us plaintive emails to tell us that we are very, very out of date on patches on them. Yeah, we'll get right on that.
(Note that we don't make any particular attempt to minimize the software installed on appliances, although we do minimize the running network services. But that's another topic.)
2009-06-22
Why GNU tools are sometimes not my favorite programs
Presented in the traditional illustrated form:
; cat a root cks ; cat b root cks abc ; comm -13 a b >/dev/null comm: file 2 is not in sorted order
If your comm doesn't do this, don't be surprised; this behavior is new
in the very latest version of comm, from coreutils 7.2 (as installed
on Fedora 11; Fedora 10 didn't have it). This behavior is turned off by
the new --nocheck-order option, although the manpage contains scary
warnings about this not being supported.
Congratulations, GNU coreutils maintainers. You have just broken any
number of scripts that were using comm to obtain differences between
ordered files; all of these scripts now produce extra output, which
is bad. Worse, fixing this will make the scripts
unportable, since not even previous versions of GNU comm understand
the new --nocheck-order option.
(Yes, yes, technically this behavior is allowed by the Single Unix Specification. But in real life, this is false; the true specification is not whatever is allowed by the letter of standards, it is what everything does and what people write to.)
Also, this is utterly the wrong way to change behavior like this. The correct way is to first introduce the necessary command line switches but not default to emitting a warning, with a note that in X amount of time the default will change. Then several versions later you can start to think about changing the default, since people have had a chance to add the new options to their scripts. (You will still fail, because people don't even look at perfectly working scripts, much less update them, but at least you will have made vague motions towards doing the right thing instead of being an asshole.)
2009-06-13
One of the reasons good alerting is tough
One of the reasons that alerting is a tough problem to solve well is what I'll call the dependency problem. It goes like this: imagine that you have a nice monitoring system and it's keeping track of all sorts of things in your environment. One day you get a huge string of alerts, reporting that server after server is down. Oh, and also a network switch isn't responding.
Of course, the real problem is that the switch has died. It's being camouflaged behind a barrage of spurious alerts about all of the servers behind it, which are no longer reachable and look just like they've crashed too. This is the alerting dependency problem; the fact that the objects you're monitoring aren't independent, they're interconnected. Reporting everything as if they were independent produces results that are not necessarily very productive, especially during major failures.
The obvious but useless solution to this is that you should configure the service dependencies when you add a new thing to be monitored. This has at least two problems. First, sysadmins are just as lazy as everyone else, especially when they're busy to start with. Second, this dependency information is subject to the problem that sooner or later, any information that doesn't have to be correct for the system to work won't be. Perhaps someone will make a mistake when adding or changing things, or maybe someone will forget to update the monitoring system when a machine is moved, and so on.
(One way to look at this is that the dependency information is effectively comprehensive documentation on how your systems are organized and connected. If this is not something you're already doing, there's no reason to think that the documentation problem is going to be any more tractable when it's done through your monitoring system. If you are already doing this, congratulations.)
So, really, a good alerting system needs to understand a fair bit about system dependencies and be able to automatically deduce or infer as many as possible, so that it can give you sensible problem reports. This is, as they say, a non-trivial problem.
(Bad alerting systems descend to fault reporting.)
2009-06-11
There are two different purposes of monitoring systems
It's worth saying this explicitly: monitoring systems have two different purposes, one of which is sort of a subset of the other (but not necessarily).
The first purpose of monitoring, and what many people initially install a system for, is alerting, letting you when there are problems. The second purpose of monitoring is tracking, gathering ongoing data for historical analysis; this is part of the vital work of getting statistics. Put this way, it's clear that these two overlap (sometimes badly); it is useful to track what you alert on (even if it is just whether or not there was an alert), and it is all too common to alert on everything that you track.
It's tempting to say that alerting is a subset of tracking, but I maintain that this is a mistake. Alerting with history needs fundamentally different features than just tracking with telling people when the value of the tracked object is out of range; for example, if you take alerting seriously you want to have some way of sending alerts only once.
(And this is the tip of the iceberg. Alerting is difficult to do well. To be fair, so is tracking.)
It follows that when you decide to monitor something, you should decide why you're monitoring it; do you want to track it, to alert on it, or both? Not everything makes sense to alert on, and not everything makes sense to track in detail.
2009-06-08
Monitoring systems should be always be usefully informative
There is a fashion in monitoring systems that, once you have one, you start monitoring and alerting on everything that you possibly can think of, no matter what it is. If you can measure it, you do, and when it gets too big or too small your system lets people know about it.
This is, by and large, a mistake.
It is a mistake because you've created a system that isn't actually (usefully) informative, just noisy. What your monitoring system should be telling you about is real problems that you can do something about, not things that either aren't real problems or aren't problems that you can do anything about.
(Take, for example, the ever popular monitoring of free user disk space. At least around here, there is nothing we can do it if a user filesystem runs out of space; we can neither go in and remove user files to get space back nor add more space that the users haven't paid for.)
The less noise your monitoring system has, the more likely people are to look at it and actually pay attention if it has trouble indicators. A monitoring system that always shows trouble indicators is about as useful as a fire alarm that is on all the time (although probably less annoying).
Yes, yes, people can learn to ignore 'known harmless' trouble indicators. The problem is that this takes mental work, which means that it takes more effort to check the monitoring system, which means people do it less often or pay less attention to it or both. It also means that you cannot look at a top level summary and get anything useful from it, because the overall system is never in 'all green' condition. And having something that you can quickly glance at to look for problems is a significant win.
Sidebar: the case for widespread monitoring
There is a case for tracking everything you can, provided that your monitoring system keeps history and can display 'measure over time' graphs or the like. Then what you're doing is getting statistics, which is vital. But if you're tracking things for statistics, you should not alert on them.
So by all means track user disk space usage, so that you can draw nice graphs about six month trends in space growth that clearly justify another shelf of disks. Just don't alert on them.
(This is one area where canned monitoring systems are your friends, because they have probably already got systems to keep lots of history of random measurements and graph them for you.)
2009-06-06
User perceptions (and expectations) of backups
One of the reasons I think we can go with our planned backup schedule is how I think that users will perceive it. This matters because unless you're lucky enough to have a specific backup schedule mandated from high, part of setting up your backup system is managing user expectations of what you can deliver and making sure that they feel that it's reasonable.
(Let us take it as given that you cannot just ask the users what they want and then deliver all of it. Users don't want to have to think about that sort of stuff, and if you push them to do so anyways they will tell you that sure, they'd love to have daily backups that go back five years. Because, really, who wouldn't?)
Specifically, my sense is that the further your backups go back, the less 'precision' that people will expect, the less they will expect you to be able to get back a file as of a specific day. Intuitively, it seems far less reasonable to demand that your system people be able to restore your files as of exactly one year, three months, and two days ago than it does to demand that they be able to get back your files as of last Tuesday.
(Partly this seems intuitive because in general recent events in are more vivid and more precise in people's minds than more distant ones.)
You can of course shift your users' intuitions and train them to expect your backup system to be more precise than that. Just be sure that you can deliver on it.
(We never discussed specific coverage with our users, partly because it was complicated, partly because we knew it would change over time, and partly because very few users are actually interested in that, or care enough to demand that it be officially documented and promised.)
2009-06-05
How we're planning our backup storage capacity needs
Part of the fun of backups is trying to work out how much backup storage you need in order to do a decent job of recovering from mistakes. Unless you are lucky, you will not have a specific mandate about how much to save for how long, which means that it is up to you to figure out a scheme that you can afford, that provides enough coverage, and that people will be happy with; there are no one size fits all answers.
We didn't do anything sophisticated for our old tape based backup system; we let Amanda reuse tapes in its default least-recently-used order, and just bought enough tapes that we could go a year before we cycled around at our planned backup frequency. This worked okay initially but then started falling down when we outgrew our tape backup capacity.
Moving to our disk based backup system has given us an opportunity to rethink our approach, rather than just buying a pile of disks to go along with the pile of tapes we already have. Our current plan is that we will do periodic 'checkpoint archives' to tape (using our old tape backup system and the tapes we already have), probably once every three months, and then have around a month's worth of rolling backups (so that we can go back to any day for up to a month) in the disk based system. Our goal is to have at least a year and ideally two years or more of the checkpoint archives.
One reason for this split scheme is the sort of restore requests we tend to get. When people ask us to restore something from a significant time ago, it's usually some old archival data that they cleaned up thinking that they would never need it again (only to find out otherwise now). As old archival data it tends to have been sitting there unchanged for months (or years), so a backup from anywhere in a broad time range is fine.
Or in short, for old restores we tend to get asked for 'any copy of the file from before time X (when I removed it)'. So we don't need to have lots and lots of copies (one for every full dump in Amanda's regular dump cycle), just a few.
(Disclaimer: as before, I am just writing this down; the hard work and planning was done by my co-workers.)