Wandering Thoughts archives

2011-08-31

Who is the audience for a trouble ticket update?

One of things that commentators brought up in response to my entry on why we don't use a trouble ticketing system is that trouble tickets have multiple uses; for example, they can be used later to look up what you did to solve a problem, and the user can use them to see how their issue is progressing. I expect that this is a common thing to say as a virtue of a ticketing system. However, I don't think that this is as easy as it sounds.

First, let's ask an awkward question: when you write trouble ticket updates, who are you writing the update for? Because these praiseworthy goals are generally in conflict; unless you have very unusual users, you cannot write an update that is simultaneously keeping your fellow sysadmins in the loop, documenting the solution to an ongoing problem, and giving the user useful information. Each of these goals calls for a different sort of writing with different contents.

(If you can resolve the problem in a single update you can at least collapse the first two cases together, but not all problems are amenable to this. And once you have a multi-step diagnosis and fix, well, as I've written earlier lab notebooks are not changelogs and so a series of progress reports are not the same as real documentation of a solution. You can reconstruct the latter from the former, but it is a reconstruction and it takes work; what you really want is to do the reconstruction once and then write it all down neatly.)

Particularly, I think that you need to decide right up front if trouble ticket updates are for you and your fellow sysadmins or if they are for users. If they are for sysadmins they contain deep-dive technical details that may well be opaque to someone who doesn't know your system environment. If the updates are for anyone else, you need to write them so that they can be understood by outsiders (with as much or as little actual technical details as you think your users can stand); this is likely to not include important details that fellow sysadmins need.

(A closely related issue is something that I wrote about back in PrivateTicketing, which is that there are times when you need to have discussions that users should definitely not see. These discussions obviously can't go in a public ticketing system.)

You can resolve all of these issues with a sufficiently complex trouble ticketing system, one where you actually have several different audiences for ticket updates (commentators on PrivateTicketing pointed out ticketing systems that support this in various ways). My personal feeling is that trying to wedge all of these different jobs into a single system is going to create something that's rather ungainly, but seeing as we've never tried to use a trouble ticketing system I have to admit that I have no hard evidence for this.

Sidebar: an example of writing for users versus sysadmins

Consider the user-focused writeup of the incident I described here. As you can imagine, my internal writeup included a great many more details than this (eg, I didn't name the failed backend in the user writeup) and omitted some things (as implicitly known by my fellow sysadmins). It also used more technical terminology, because using technical terminology is generally faster and more precise than more general writing.

TicketingAudience written at 01:25:38; Add Comment

2011-08-29

Devirtualization

Not quite a year ago, I wrote about our use of virtualization here in OurVirtualizationUse. That entry is now what we call 'inoperable', because we have just finished de-virtualizing; reverting back from virtual machines to running those machines directly on real hardware. We did this ultimately because we think that it's less of a hassle to run machines on real hardware than virtualized.

In retrospect, I think that part of the problem is that our virtual host machine was underconfigured in terms of RAM and disk space. It had what seemed like a decent amount of both, but virtual machines add up surprisingly fast (especially once you start worrying about backups of them and so on). And certainly it's a factor that we have plenty of spare hardware that's large enough to easily run stuff on, even a single Windows server.

But a significant issue was that managing the host environment was a hassle and managing the virtual machines on top of the host environment was another hassle. Because we only had one host machine, doing any significant maintenance on the host machine meant taking down all virtual machines and then starting them again. And while making the host machine an Ubuntu LTS server was useful because we already had a bunch of infrastructure for managing those, it meant that we did have to manage it and in particular we had to apply kernel updates. Every kernel update was a hassle, and it didn't help that we were using a third-party virtualization system that needed special magic hand-holding after a kernel update. Even beyond the host updates, managing the virtual machines always took remembering the special steps needed to do things get 'console' access to them and force power cycle them. The whole experience was just annoying.

In the end, the person who built the next generation of our Windows terminal server decided that he would rather build it on real hardware, partly because it would be less of a hassle and partly so that he could easily give it a lot of disk space and memory; that decommissioned two of our three virtual machines. Keeping a virtual host machine running just to host a single virtual machine that forwards some low-priority email seemed, well, kind of stupid.

(I'm thinking of this because today was the migration from the old virtualized Windows terminal servers to the new non-virtualized one. Normally we might keep the old virtualization server running in the corner just in case, but there's an overnight building wide power shutdown tonight so we powered it off along with everything else and we don't plan to turn it back on.)

I definitely don't think that this makes virtualization a bad idea in general. It didn't work out for us, but I think it would in a different sort of environment and certainly it works for other people. (I have some thoughts on what sort of environment it would probably work well in, but that's for another entry.)

Devirtualization written at 22:55:22; Add Comment

2011-08-28

The problem with busy sysadmins

As I wrote yesterday, one of the reasons that I don't think we need a trouble ticketing system here is that we are not all that busy with requests; in fact, the average number of open requests is generally zero. I think that this is a good thing, and that frequently having more open requests than sysadmins or even having all of your sysadmins busy dealing with requests is an important danger sign.

It's a danger sign because it means that your sysadmins are too busy to do long-term work. Instead, they're either doing things that need to be automated, or spending their time running around throwing water on a succession of fires, or in general being reduced to being operations monkeys who carry out standard procedures (which is not a good way to keep good sysadmins, because frankly it's boring). In short, you're too busy dealing with the now to be building the future. This is not a good place to be in; the future always arrives sooner or later.

Sysadmins that are 'idle' from the perspective of a trouble ticketing system are sysadmins that have time to work on larger projects and to prepare for the future. And you need more than little bits of time here and there, for the same reason that developers need it; you need solid blocks of time where you can focus on a single thing instead of playing whack-a-mole with trouble tickets.

Thus I consider it an extremely healthy sign that we have so few requests that a 'level 1' mailing list setup can handle it. This is the kind of environment you need to produced things like cheap iSCSI based fileservers, upgrades of complex mail systems, and transitions of core routers in a complex network environment.

(My personal feeling is that trying to put long term development projects in a trouble ticketing system is not going to work really well. It's the kind of thing that I would only do if I had to use a TT system for everything because of some outside mandate.)

BusySysadminProblem written at 01:10:17; Add Comment

2011-08-26

Why we don't use a trouble ticketing system

In an aside in this entry I mentioned that we don't use a trouble ticketing system here, and a commentator asked the obvious question of why not. The short answer is that we don't think moving from our current approach to a trouble ticketing system would be useful enough to justify the amount of extra work and annoyance it would take.

For the long answer, I am going to make up a hierarchy of approaches to tracking your work, especially the work that people ask you to do:

  • level 0: people make direct contact with individual sysadmins and everyone is working more or less independently. If a sysadmin isn't available, most things they were working on get dropped or put on hold.

  • level 1: you have a central mailing list or other point for all email with people; they send email to the mailing list to ask you for things and sysadmins copy their replies to the mailing list.

  • level 2: you have an actual trouble ticketing system with automation, a website, status summaries for tickets, reporting, and all of that.

Moving from level 0 to level 1 is a big deal. It changes your work life in a very positive way; it means that everyone can stay informed about what's going on and any sysadmin can pick up some particular task from another sysadmin. A level 1 system is not perfect for the same reason that our old account request system was not perfect; reconstructing the state of any particular request can take trawling through your email archive in order to find all of the messages, and it's hard to get an overview of all of the open requests.

Moving from level 1 to level 2 has benefits but in many environments it's less of a sea change than moving from level 0 to level 1, and it comes with overhead; an actual trouble ticketing system is a form of bureaucracy and dealing with it is invariably more work than sending a quick freeform email message. My feeling is that a level 2 system is justified when you have enough request volume that you need it to keep track of things. I can think of two warning signs of this; first, when individual sysadmins start ignoring most of the mail from the central mailing list in order to survive the volume, and second, when you frequently have more pending requests than you have sysadmins (then a ticketing system helps you to easily find the next thing to work on).

(Your manager may also demand a trouble ticketing system in order to make it convenient to generate various metrics.)

Right now, we have a level 1 one style system with a central mailing list and a relatively low request volume. So for us, moving to a ticketing system would add bureaucracy (and another software system to run) without having particularly compelling benefits; we simply aren't active enough to make it a necessity instead of something to play with.

WhyNotTTSystem written at 16:11:10; Add Comment

2011-08-25

Trouble ticketing systems and the future

Here's a theory about trouble ticketing systems: they should allow you to submit tickets that are dated in the future, or that only become active in the future.

Here's why. First, it's not uncommon to know now that you need to do something in the future (and further that you cannot do it now). Second, in many environments your trouble ticketing system is supposed to be your authoritative and official repository of known work; if something is to be done, you're supposed to put it in the TT system so that it can be tracked, monitored, and so on.

Now, you can certainly deal with known future events by using your calendar system to leave yourself a note that you need to make a trouble ticket next Tuesday. If that sounds silly when I put it like that, well, yes, it is. It's a workaround, and it also means that your TT system is no longer the complete repository of known work; some of it is floating around in people's calendars instead.

The corollary is that you can't really fix this by providing a tool that automatically submits tickets for people (and then driving it out of your calendar system). This only avoids the absurdity of manually copying information around from calendar to ticketing system; it doesn't make these future tickets actually visible in advance in the ticketing system if you want (or need) to see them.

The reverse corollary is that putting everything into the TT system and manually marking things that can only be acted on in the future is also the wrong approach (although it may be better than calendars in your situation), because it generally clutters up everyone's view with things that, well, they can't act on right now. Most of the time you only care about what you're going to do next Tuesday on next Tuesday, not before.

(Disclaimer: we don't use a trouble ticketing system here for various reasons, so I don't know if good ones already can do this. I polled some people I know and none of them had TT systems that really supported it.)

FutureTroubleTickets written at 00:59:36; Add Comment

2011-08-19

Visibility: an advantage of automation

We recently automated our account request system, just in time for the group of new graduate students that happens at the end of every summer. In past years we've handled new graduate students through a manual process; a single person here went through the list, grouped them by supervisor, asked each supervisor to officially confirm sponsoring accounts for their students, re-asked professors who didn't respond, kept track of the answers, asked the approved new graduate students to pick their logins, and so on. This year, the account request system has handled almost all of the work once we got the list of new graduate students (it doesn't automatically re-ask unresponding professors).

One of the surprising things about this change has been how much more visible the state of this process has become to all of us. In the old system everything happened through email, all of which was copied to a tracking alias that everyone was on. This meant that to figure out the status of a request, you generally had to go back through your email archive to reconstruct its state. You can guess how often we even thought of doing this.

(The person doing the work kept track of the current state of everything, and she even told me once where to find the file she used for it and what all the fields meant. Did I remember this the next year? No.)

In the automated system all of this is already tracked in a database, so it was easy to put in support to show the information in various ways. Now it's trivial for any staff member to check into the system and look up, for example, which incoming graduate students have been approved but haven't picked their logins yet; all it takes is glancing at a web page. Because it's so easy, we actually do do it (or at least I do); because I do it, I now know a lot more about how the whole process is going than I did in past years.

This is essentially a necessary consequence of automating the system. For an automated system to work, it needs to be able to determine the current state of everything using its own information; once it can do that, it can tell you about it just as easily as it can use it to do its job. You could say that automation forces you to put information in an easily usable form.

(In the manual system, even an official central place for this information wouldn't have given us quite the same effect because someone would still have had to update it by hand when email came in. The automated system's views are always completely up to date, partly by definition; if it isn't in the system, it's not official.)

AutomationVisibility written at 01:21:27; Add Comment

2011-08-12

One reason we install machines from checklists instead of via automation

I've been revising our install instructions for some OpenBSD servers recently, giving me an opportunity to reflect on how we set up machines here. Our general approach is to use a checklist of essentially cut and paste commands; I often go through a test run directly cutting and pasting back and forth. Given that we have the literal commands to run in the instructions, why not automate the install process by putting them all in a script?

Well, turn this question around. What would I have to do in order to transform the commands into an install script? At one level, basically nothing; I'd turn all of the commentary in the install instructions into script comments and we'd be pretty much done. And then one day something would go wrong during the install process and the script would explode spectacularly.

The drawback of automation is that there is nothing that's really checking for things going wrong. Oh, you can check for obvious errors (sometimes), like commands exiting with a failed status, but not all problems cause such obvious failures. Any number of failure modes will cause your commands to exit with a success status but either do nothing useful or badly mangle the system state.

(For instance, a ./configure 'succeeds' but fails to find all of the dependencies you expected so it builds a version of the program without features that you need.)

You can make automation more robust, of course. But it takes both work and anticipating how things may fail; a reliable, cautious automated install process is much more work than simply sticking all of the commands from the checklist in a shell script (and it's very hard to really be completely safe against problems). If we stay with a checklist that's performed by humans, we get much of the benefits of automation without having to do that work. Rather than try to code error checks, we can count on people to use their brains to notice when something's wrong.

(In our environment checklists are guides and aids for sysadmins, not things to be carried out by mindless rote.)

PS: there are of course situations where automation still makes sense even despite this. But that's something for another entry.

ChecklistsVsAutomation written at 00:10:24; Add Comment

2011-08-05

Why sysadmins don't just notify users about compromised machines

One of the possible reactions to the issue of banning the MAC addresses of compromised machines is to suggest that what sysadmins should do is not ban the machine but instead contact the machine's owner to tell them about the problem and get them to deal with it. Let me give you the sysadmin perspective on that.

To start with, let's agree that there are two sorts of compromised or infected machines that your IDS has detected: ones that are actively trying to do nasty things and ones that are just showing signs of infection, like phoning home to botnet controllers. The first sort have to be immediately quarantined when detected, so the real issue is what to do about the second sort of machines, which are mostly or entirely 'harmless' at the moment.

Ultimately, the reason that sysadmins don't just notify the machine's owner is that this rarely solves the problem. There are two aspects of this. First, there are a number of practical difficulties in getting in touch with the user:

  • while you have the identity of the person who registered the machine, this may not be the machine's current user.
  • the email address that you have for them may not be one that they check regularly, although this is really only a university issue.
  • it's possible that the malware they're infected with is filtering or otherwise intercepting their email. (I don't know if any current malware is this smart.)

Much more importantly, painful experience has shown sysadmins that if you just send people email, many machine owners either don't care at all or don't care enough to do painful but necessary things like reinstall their operating system from scratch. Even when people are compliant and willing, what they decide to do may not be anywhere near sufficient; they may just run a malware scanner or two and then declare that their machine is clean because those scanners showed nothing. You can spend a great deal of time doing what is basically nagging people and get no actual results from it, in the process wasting everyone's time and annoying everyone (assuming that people are even bothering to read your email).

Blocking machines more or less automatically has the great virtue (from a sysadmin's perspective) that it gives the machine's user no option to ignore the issue. One way or another, the machine's problem is going to get dealt with (or at least contained, if it stays off the network).

(Whether this is the right approach in general is another issue entirely, one that does not even start fitting in the margins of this entry. This entry is just about the sysadmin perspective.)

As a side note, all of this 'contact the user' stuff assumes that you know who the theoretical responsible person for a machine is. This is true in the situation in my first entry but is not necessarily true in general. This may be a peculiarity of universities, but you would be startled at how hard it can be to find out who is the technical person for a particular subnet, much less a particular machine, and then how hard it is to get in touch with them. Blocking machines and waiting for their users to speak up can be basically the only feasible way to find out who you need to talk to and get them to respond to your contact attempts.

WhyNotCompromiseNotification written at 01:54:42; Add Comment

2011-08-04

On banning MAC addresses

Via Hacker News I wound up reading Shenglong's Why ISPs Shouldn't Ban MAC Addresses. As it happens, I partially disagree; there are good reasons to ban MAC addresses under some circumstances. The short summary is that banning a MAC address is an ineffective way to keep a person off your network, but it is a decent way to keep a machine off your network. The question you always need to ask is where the problem is.

(There are plenty of ways for a person with a banned computers to get back on your network, ranging from simply having other devices to various levels of tricks that they can perform with their banned machine. If you want to throw someone off your network, the minimum steps are to revoke the registration of all of their other devices and block their ability to register new ones.)

Sometimes the problem is some violation of network usage policies (the typical things that people think lead to bans, like running inappropriate things like filesharing or visiting 'bad' websites). Then the problem is with the person and banning the machine is only a moderately effective way of making it stop; a determined user can still continue their activities in various ways.

(There's an argument for still doing it as a first step in dealing with this sort of problems; machine blocking may be a coarse net, but it will still catch a fair number of fish. And it's often easy and relatively low impact.)

However, sometimes the problem is actually something bad that the machine is doing, something that's most likely the result of either a compromised machine or a configuration mistake instead of deliberate, conscious user action. If the problem is with the machine you do want to quarantine the machine but you don't want to ban the person; it's a feature that they can still get on to your network through other devices, not a bug. If they're clever enough to use this to get the same machine back on the network, they're hopefully also wise enough to do this carefully.

(We've occasionally had to block machines by MAC address here for exactly this reason; the machine is clearly doing something bad, but there's no evidence that the person has become evil.)

Having said that, there are good and not so good ways to do machine blocks. As Shenglong's story shows, the problem with a plain block is that there's no indication to the person what the problem is; as far as they can tell, their machine is either malfunctioning or mysteriously banned. A better approach would be something like a registration portal, where their web browsing is redirected to a captive page that tells them that the machine's network access has been cut off because it appears to be infected. You get bonus points for having a bunch of (local) resource links for how to get disinfected and so on, and ideally something that tells them what you detected about their specific machine.

(And having written that I have to admit that we don't currently have such a setup, although when we block machines we do tell the user's Point of Contact, who will contact the user in some appropriate way.)

Sidebar: think twice before working around a block

If your machine ever gets blocked, I strongly suggest being very careful before working around the block and getting the machine back on the network. This isn't because it's against your local network usage policy (although if it is, that's another reason to think twice), it's because if your machine is behaving badly and the sysadmins can't keep it off the network without throwing you off the network entirely, sooner or later they will wind up doing so.

(And if this happens, getting back on the network will likely involve some very awkward conversations.)

Thus before deploying various workarounds to regain network access, you want to be as sure as possible that your machine is in fact not compromised and not misbehaving. Unfortunately, these days this really requires checking from some outside vantage point (as noted in a comment on Hacker News).

(This is also a good time to mention my zeroth law of compromised machines.)

BanningMACAddresses written at 02:19:05; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.