2012-10-25
Always make sure you really understand what your problem is
Our recent disk performance issue has been good for a bunch of learning experiences. I'm not talking so much about stuff like becoming familiar with DTrace or discovering blktrace, but the humbling kind where you look back at things in retrospect and learn something from your mistakes. The first of these valuable lessons I've learned is simple:
Before you solve your problem, make sure you understand it.
In particular, make sure you know the cause of your problem.
In one sense, our recent adventure started back when we looked at the performance problems with our mail spool and confidently said 'clearly we are hitting the random IO seek limits of physical disks so the solution is replacing them with SSDs'. At the time this seemed perfectly logical; we knew that physical disks could only sustain so many IOPs/sec, we knew that mail spool performance got worse when the IO load on the physical disks went up, and we 'knew' that IO on the mail spool was heavily random. I'm pretty sure that I confidently made this exact assertion more than once during meetings about the problem.
We made two closely related mistakes here. The basic mistake is that we never made any real attempt to verify our theory. Since our theory was that the disks were saturating IOPs/sec, a very basic test would have been to actually measure the IOPs/sec we were seeing during high load conditions. In retrospect I'm pretty sure we would have found that the disks were nowhere near maxed out.
(As you might guess from the scare quotes around one 'knew', we also never made any significant attempt to verify that mail spool IO was mostly random IO.)
The bigger, more advanced mistake we made is that we never attempted more than a superficial investigation of why our mail spool was performing badly when it was on normal disks. We jumped straight from the problem to a theory of the cause to the attractive conclusion of 'we can solve the problem with SSDs'. Pretty much everything I did to look into the problem could have been done right from the start, so if I'd actually looked I would have found our switch issue. It's just that we never bothered. Instead we 'solved' our problem with SSDs without taking the time to understand what it was or finding out what was causing it.
(Our initial idea of what the problem was was 'slow IO', which was wrong as we thought of it. Our real problem wasn't that IO was slow in general, it was that more than 5% of it was very, very slow.)
It's quite attractive to not look into your problems in depth because it saves you a lot of time and aggravation. It took me more than a week of full-time work to run down our problem; that's a week I wasn't working on any other of our many projects. At the least it takes a lot of willpower to drop everything else that's clamouring for your attention and spend a lot of work on something where you're confident that you already know the answer and you're just making sure.
Of course this is where we should have started with basic verification of our theory. Basic verification would probably have disproven it with relatively little effort, and that might have provided enough of a reason to dig deeper. I might not have spent the week of time all in one block, but I could have nibbled away at it in bits and pieces. And I even was supposed to look into performance tools.
(Oh how I look back at that entry and admire the grim irony. Past me explicitly told future me that I should spend time to build up some performance tools before another crisis struck, yet I didn't listen and then exactly what past me predicted came to pass. Let's see if I can do better this time around.)
2012-10-17
Operators and system programmers: a bit of System Administrator history
I saw yet another meditation on the difference (if any) between operations and system administration make the Twitter rounds recently, which has finally pushed me over the edge to say something about this. I want to talk about the history behind this apparent division, at least as I see it from my perspective.
(My disclaimer here is that I was not around at the start of this story, only towards the end and only in an academic environment. So my perspective may be skewed.)
A long time ago, or at least back in the early 1970s, most computers were big and expensive and also required near-constant tending and physical work. If the computer was running, someone was forever shuffling punched card decks around, mounting and unmounting tapes and disk packs, collecting line printer printouts and putting new paper into the printers, and often doing manual steps to prepare the mainframe's OS for the next job (or clear out the previous one, or both). The people doing all of this were were (mainframe) operators, and for obvious reasons this was considered a relatively low-skill, low-prestige job. Mostly you needed to be able to follow instructions in big procedures manuals.
(This is a bit of a stereotype.)
At the same time there were also system programmers, because the mainframe was so expensive that it was worth it to pay expensive programmers to keep it running smoothly. System programmers generally worked on writing and fixing OS-level components and utilities, say components of your job control and batch management system; where application programmers might work in COBOL, system programmers might work in System/360 assembler. Being a system programmer was a high prestige job and was considered more challenging than being just an application programmer. A Geoff Collyer quote from a later era captures the feel rather well:
This is what separates us system programmers from the application programmers: we can ruin an entire machine and then recover it, they can only ruin their own files and then get someone else to restore them.
System programmers also had the job of actually installing and setting up mainframe operating systems, applying vendor bug fixes, and so on, because back in those days none of this was what you could call 'user friendly' or 'easy'.
The original Unix systems were not big mainframes but they didn't exactly come with detailed, easy to follow procedures manuals; my strong impression is that setting up and operating early Unix systems pretty much required a programmer. As a result most or all of the people who ran early Unix systems were effectively system programmers out of necessity. If you ran a sufficiently big and important Unix system you might have some 'operator' helpers to do physical work like changing tapes, especially if you got to drop your Unix machine into a machine room that already had operators to tend to its mainframes, but the core people who kept the machine running were all system programmers.
As Unix systems matured a bit, increasing amounts of the work necessary to keep them running didn't need actual system programming, just (dangerous) root permissions and some judgement; this was things like adding and removing users, kicking printer queues around, and so on. Places that already had operators might have their (Unix) system programmers create user-friendly menu systems or the like so that the operators could safely have this work delegated to them, but places without operators started hiring junior people to handle this. Not infrequently these junior people were some sort of programmers and were expected to upskill themselves into full system programmers over time. As Unix systems became more and more mature, running one in routine situations had less and less need of system programmers instead of these 'system administrators' (for example, pretty soon you could install and run a Unix system without having to understand the insides of configuring and compiling a custom kernel for your specific hardware). That this happened is unsurprising; it's basically the standard trajectory of a field, as what was new and unknown becomes routine and then mechanized.
(In addition Unix systems also had their 'operator' level work to be done. No matter what particular operating system you're running on what hardware, someone has to change the printer paper. This level of work on a Unix system was no more prestigious or better paid that being a mainframe operator, and for much the same reasons. Still, you could aspire to move up.)
During this slow change, system programmers and a system programmer mindset hung around many old and well-established Unix shops for a very long time (and some vestiges of it linger even today in places). This was the era when it was not surprising for a 'senior system administrator' to be able to write serious system programs like a mailer, a Usenet news server, a nameserver, or a language, and it was kind of expected that every senior sysadmin worth the name was really a system programmer; you can see this fairly vividly in LISA proceedings, especially older ones. By extension this attitude somewhat pervaded the general Unix system administration culture of the time (and it didn't hurt that scripting is genuinely useful on Unix).
What we are left with today is a confusing blend of a situation. A significant amount of the cultural background of Unix system administration comes from system programming, while the actual work required to keep machines running spans a vast gamut. But we often expect that one person or a small team will cover the whole gamut and for historical reasons we default to applying the label 'system administration' to the whole result. All of this makes it hard to break things up when we want to talk about how to do things in a large scale environment, one where we will have people with different skill levels who will work on significantly different things.
(In short, no one agrees on what to call things and 'system administration' is too ambiguous and too broad.)
This has wound up being rather more thinking aloud than I was expecting when I started writing.
Sidebar: why mainframes were different
My view is that mainframes were always big enough (and expensive enough) that you needed a decent sized organization to run one. Mini and microcomputers have been blessed with much smaller sizes and much lower costs, so you could acquire one and tell a single person or a very small group that they were in charge of everything to do with it.
(In the early days of Unix this was often 'whichever grad student was closest and sufficiently slow moving'. To be honest, it's still that way around here every so often.)
Sidebar: a little bit of local trivial
As you might expect the university has a long history on the system programmer side of things, a history that still lingers in various ways. One of them is that most system administrators here were officially 'System Software Programmers', a job classification that was considered superior to (and better paid than) mere 'Application Programmers'.
(I'm out of touch with official job classification fun here, so this may still be true.)
2012-10-15
The anatomy of a performance problem with our mail spool
This is a sysadmin war story.
One of the filesystems on our fileservers is our mail spool (/var/mail, where
everyone's inboxes live; other folders live in their home directories).
For years, we've known that the mail spool was very close to the edge of
its performance envelope, with the filesystem barely able to keep up.
A message to one of the department-wide mailing lists would routinely
drastically spike the load on our IMAP server, for example, and it was
very sensitive to any significant extra IO load on the physical disks (which it shared with some of our
other ZFS pools).
Our solution was to move the mail spool to special SSD-based iSCSI backends; we felt that the mail spool was an ideal case for SSDs, since both mail delivery and mail reading involves a lot of random IO. For reasons beyond the scope of this entry the project moved quite slowly until very recently, when it became clear that the mail spool's performance was even closer to the edge than we'd previously realized. Two weeks ago is when we actually finally did the move and had our mail spool running on SSDs. Much to our unhappy surprise, the performance problems did not go away. A week ago, it became clear that the performance problems were if anything worse on the SSDs than they had been on hard drives. Something needed to be done, so investigating the situation became a high priority.
Because I don't want this entry to be an epic, I'm going to condense the troubleshooting process a lot. We started out looking at basic IO performance numbers, which showed a mismatch between SSD performance on Linux (2-3 milliseconds all the time) and Solaris 'disk' performance (20-30 milliseconds under moderate load, 40-60 or more milliseconds when the problem was happening). A bunch of digging with DTrace into the Solaris iSCSI initiator turned up significant anomalies; what had looked like somewhat slow IO was instead very erratic IO, with a bunch of it SSD-fast but a significant amount very slow, slow enough to destroy the user experience. Also, we actually saw the same problem on all of the fileservers, it's just that the mail spool had it worst.
(Blktrace showed that the actual disks didn't have any erratically slow responses.)
Fortunately we got a lucky break: we could reproduce the long IOs with a copy of the mail spool on our test fileserver and test backends. This let me hack the iSCSI target software's kernel module to print things out about slow iSCSI requests. This showed that the problem appeared to be on the Linux backend side and that it looked like network transmit problem; the code was spending a lot of time waiting for socket send buffer space. I figured out how to increase the default send buffer size but it didn't do any good; while the Linux code wasn't reporting slow requests, the Solaris DTrace code was still seeing them. So I hacked more reporting code into the Linux side, this time to dump information about the network path that the slow replies were using. And this is when I found the cause.
As discussed here, our fileservers and backends are connected together over two different iSCSI 'networks', really just a single switch for each network that everything is plugged into. For reasons beyond the scope of this entry, we use a different model of switch on the two networks. It turned out that all of our delays were coming from traffic over one network and we were able to conclusively establish that the problem was that network's switch. Among other things, simply changing the mail spool fileserver to not use that network any more made an immediate and drastic change for the better in mail spool performance, giving us the SSD-level response times that we should have had all along.
Ironically, this switch was the higher-end model of the two switches and a model that we had previously completely trusted (it's used throughout our network infrastructure for various important jobs). It works great for almost everything, but something about it just really doesn't like our iSCSI traffic. Our available evidence points to flow control issues and we have a plausible theory about why, but that'll take another entry.
One of the startling things about this for me is just how indirect the cause of the problem was from the actual symptoms. Right up until I identified the actual cause I was expecting it to be a software issue in either the Linux target code or the Solaris software stack (and I was dreading either, because both would be hard to fix). A switch problem with flow control was not even on my radar, so much so that I didn't even look at the iSCSI networks beyond verifying that we weren't even coming close to saturating them (and I didn't consider it worth dumping information about what network connection the problem iSCSI requests were using until right at the end).
The good news is that this story has a very happy ending. Not only were we able to fix the mail spool performance problems, but at a stroke we were able to improve performance for all of our fileservers. And the fix was easy; all we had to do was swap the problem switch for another switch (this time, using the same model of switch as on the good iSCSI network).
(The other good news is that this problem only took two weeks or so to diagnose and fix, which is a big change from the last serious mail spool performance problem I was involved with.)