2009-03-31
A sysadmin use for Twitter
We have some users that are interested in reading about technical system status updates and what the sysadmins are doing in general. The obvious solution is some sort of blog-like environment where sysadmins write things periodically, but the problem with an actual blog is that it takes too much time to write something, especially if we are in the middle of a semi-crisis.
Hence the attraction of Twitter. The short length of Twitter messages (and their lack of formatting) means that we simply can't write very much and in turn users can't demand very much (and can't get disappointed when we fail to turn out polished marvels of educational clarity). Twitter is also well supplied with clients, including command line clients, that are basically fire and forget; you type your message at a command line or into a text box, and you're done.
(I have seen Twitter described as micro-blogging, which is just what we want; something like a blog but much smaller and easier to deal with, and with lower social expectations from the people reading it.)
You could use a Twitter clone for this and host it yourself, but for this specific purpose I think you might as well use Twitter itself. Among other reasons, I suspect that many of the users who would be interested in this are the sort of people who already have Twitter accounts. (And if not, Twitter has syndication feeds.)
2009-03-29
Why I wrote my own bulk IO performance measurement tool
So, you might wonder why I wrote my own program to measure bulk IO
performance (as mentioned in this entry). After all, there are quite a lot of
programs to do this already, many of them very sophisticated and
some of them (like dd) that are everywhere.
The short answer is that I wanted a program that gave me just the results that I wanted, directly measuring what I wanted measured, and that I understand exactly what it was doing and how it was doing it. I did not want to measure lots of parameters or post-process something's output; I was only interested in the bandwidth of streaming bulk IO performance (both read and write), and I wanted to get immediately usable numbers with a minimum of fuss.
(This makes it essentially a micro-benchmark.)
The problem with all of the other benchmarking programs is exactly that all of the ones that are easy to find are the sophisticated ones. They measure a great many things, or they require a bunch of configuration, and I am not sure of exactly of what they are doing at a low level (and with disk IO, sometimes the low level details matter a lot; consider the rewrite issue).
(The problem with GNU dd is that it is not everywhere, especially
the modern version that will tell you IO bandwidth numbers instead
of making you work them out yourself.)
My program also has some additional features that have turned out to be handy. The most useful one is that it will periodically report its running IO rate, which has been very useful for spotting stuttering IO. (This happens more often than one would like, even for reads.)
2009-03-24
Frustration for a sysadmin (well, for me)
To amplify on a previous entry, I've found that the biggest thing that gets under my skin as a sysadmin, that really leaves me frustrated and short-tempered, is being helpless. Well, not exactly helpless in general; what I mean is being faced with a situation where there's a technical problem (something within my sphere) but there is nothing that I can do, or nothing useful (where I know I am just flailing around, taking random stabs completely in the dark).
This might seem counterintuitive to outsiders, because sysadmins spend a lot of time troubleshooting problems; if doing this frustrated me that much, you'd think that I'd have gotten into a less annoying field by now. But even with apparently mysterious system problems there are usually a lot of things that we can do to diagnose things and to move forward, to at least get closer to a solution; we have piles of tools and techniques and so on. Even if we're not solving the problem right now, we're productively working on it.
(Well, in theory. In practice it can be both tedious and mysterious to try to run down a problem as you diagnose just what's going on; I have spent more than enough time waiting for test runs to finish and the like.)
However, every so often these tools and techniques run out, or I'm stuck in a situation where they're not available; then there's no feeling of productive work, no forward motion on the problem, no nothing, and I am sitting there helpless and powerless. That's when it all gets to me, sometimes badly.
(My classic weak spot is trying to make various sorts of serial connections work; there are so many different things to go wrong, and basically no troubleshooting tools.)
2009-03-12
The problem with /var today
When /var was created, people took
everything in /usr that got written to and just threw it all into one
filesystem. After that, /var became the place that you put anything
(besides config files and the like) that needed to change or be written
to, regardless of why.
The problem is that /var has wound up with two very distinct sorts
of data in it: private program data and public data. Private program
data is the entire collection of caches, databases, and other tracking
information that various programs use to do their jobs. Public data is
everything that users and sysadmins create and look at, with things like
/var/mail, /var/log, user crontabs, and so on. (On some systems this
may include web pages, SQL databases, and more.)
This matters because the two have very different importances and need very different sorts of handling for things like backups and operating system upgrades. Fundamentally, you don't care about private program data as long as the program works right and you probably actively want to not preserve it when you do things like reinstall the system or roll back to a previous system snapshot. However, you absolutely must preserve public data when you do things like reinstall the system.
That the two sorts of data are aggressively commingled in /var causes
all sorts of practical problems for system management. Effectively,
/var has been turned into both a system filesystem and user
filesystem, and the two generally require very different and conflicting
treatment. Attempts to patch this up in software are awkward.
(For example, Sun's Live Upgrade stuff goes to all sorts of contortions
to try to copy some bits of your public data between various copies and
snapshots of your system's /var.)
The obvious solution is to split /var into two filesystems, one for
each sort of data. Unfortunately, changing Unix filesystem habits is a
lot of work (and work that really needs to be done by Unix vendors in
order for it to stick).
2009-03-11
Checklists versus procedures
Here is something that I have not been at all clear about: the sort of checklist usage that I've written about is specifically using checklists in order to plan and organize one time things, such as migrating our mail storage.
I care about this because I think of it as a different sort of thing entirely than ongoing work that we do repeatedly and routinely. If you do something routinely and it is not trivial, you should have a documented procedure for it. However, that procedure may or may not involve an actual checklist that you go through, depending on what works for you.
(Arguably it is worth documenting even trivial procedures.)
Locally I would not use a checklist for most routine procedures, because how I use checklists specifically involves marking things off (in electronic form I add 'DONE' after each step in the checklist file; on paper I mark things with a pen). If I tried to use literal checklists for routine procedures, I would be making a copy of the master checklist every time in order to do this marking off, and I am confidant that that would get very annoying very fast.
(One way of putting this is that using an explicit checklist is additional overhead. I am willing to accept the overhead in exceptional situations, and indeed in those it may not even be overhead, but in routine ones it can rapidly descend to bureaucratic make-work.)
Sometimes, in the process of documenting our changes and actions we do wind up writing what I could call an inverted checklist (where you add entries as you do things, instead of removing them), often based on a documented procedure. But we don't do this all of the time for various reasons, including that it is often too much documentation.
2009-03-10
Why checklists work
One of the things I've been doing much more over the past couple of years is using checklists, the virtues of which I've written about before. Recently I have been thinking about why they work, and came to the obvious realization: they're a form of talking to the duck.
Before you write down a checklist, you may think that you understand everything that you need to do, but it is in your head and your head is very good at fooling you. Like explaining something out loud, writing it down shines a bright light on all of your assumptions and fuzzy thoughts and forces you to clarify them, or at least exposes them to you.
(With a checklist specifically, one of the things it shines a light on is our belief that we can keep track of more things than we actually can, which is one of the important roots of fragile complexity. Usually writing a detailed checklist shows me not only that I was fuzzy on some steps but that I was leaving some out entirely.)
Checklists have auxiliary purposes too, of course; for example, they are communication with co-workers, they are confidence boosters, and they reduce the amount of things that you have to think about so that you can focus on paying attention to the work you are doing right now. And crossing completed items off is a useful reward.
2009-03-03
Rollbacks versus downgrades
One of the alternatives to package downgrades for problems like my concerns with trying Fedora Rawhide is what I'll call 'rollbacks': whole system snapshots that you can easily revert to. However, I'm far less enthused about rollbacks than I am about downgrades, and I don't think they're as good a solution for this problem.
The problem with rollbacks is that they are too comprehensive. In real life, you don't always detect problems right away, which means that you can wind up having also made unrelated system changes, changes that you want to keep. If you roll back to a pre-upgrade snapshot, you undo the upgrade but you also undo your unrelated changes too, and now you'll have to redo them (and perhaps track them down).
(At least this is my experience, but I think it's going to be true of many relatively non-minimal systems; if nothing else, you may not use certain features or programs very often. It recently took me several days to notice that I'd broken Flash in my browser, for example.)
This leaves me feeling that rollbacks are the wrong solution to this problem. They're a good way to handle something that breaks immediately (and a great way if you need really fast reversion to a working system). But they're not a focused solution in the way that package downgrades are, and so the further you from an immediate rollback, the worse their side effects can be.
In theory various change management systems can help you by tracking and perhaps automatically re-applying your changes. In practice I think that there are two problems with this. First, they're a lot of work to set up, especially on a one-off basis (and you're probably not going to have lots of machines in this sort of situation). Second, my experience is that some of the changes I make after upgrades depend on the new packages (I am customizing a new version of a configuration file, for example); if I roll back, I definitely don't want those changes to still be applied. There are semi-solutions to this, but they add even more complexity.