The anatomy of a performance problem with our mail spool
This is a sysadmin war story.
One of the filesystems on our fileservers is our mail spool (
everyone's inboxes live; other folders live in their home directories).
For years, we've known that the mail spool was very close to the edge of
its performance envelope, with the filesystem barely able to keep up.
A message to one of the department-wide mailing lists would routinely
drastically spike the load on our IMAP server, for example, and it was
very sensitive to any significant extra IO load on the physical disks (which it shared with some of our
other ZFS pools).
Our solution was to move the mail spool to special SSD-based iSCSI backends; we felt that the mail spool was an ideal case for SSDs, since both mail delivery and mail reading involves a lot of random IO. For reasons beyond the scope of this entry the project moved quite slowly until very recently, when it became clear that the mail spool's performance was even closer to the edge than we'd previously realized. Two weeks ago is when we actually finally did the move and had our mail spool running on SSDs. Much to our unhappy surprise, the performance problems did not go away. A week ago, it became clear that the performance problems were if anything worse on the SSDs than they had been on hard drives. Something needed to be done, so investigating the situation became a high priority.
Because I don't want this entry to be an epic, I'm going to condense the troubleshooting process a lot. We started out looking at basic IO performance numbers, which showed a mismatch between SSD performance on Linux (2-3 milliseconds all the time) and Solaris 'disk' performance (20-30 milliseconds under moderate load, 40-60 or more milliseconds when the problem was happening). A bunch of digging with DTrace into the Solaris iSCSI initiator turned up significant anomalies; what had looked like somewhat slow IO was instead very erratic IO, with a bunch of it SSD-fast but a significant amount very slow, slow enough to destroy the user experience. Also, we actually saw the same problem on all of the fileservers, it's just that the mail spool had it worst.
(Blktrace showed that the actual disks didn't have any erratically slow responses.)
Fortunately we got a lucky break: we could reproduce the long IOs with a copy of the mail spool on our test fileserver and test backends. This let me hack the iSCSI target software's kernel module to print things out about slow iSCSI requests. This showed that the problem appeared to be on the Linux backend side and that it looked like network transmit problem; the code was spending a lot of time waiting for socket send buffer space. I figured out how to increase the default send buffer size but it didn't do any good; while the Linux code wasn't reporting slow requests, the Solaris DTrace code was still seeing them. So I hacked more reporting code into the Linux side, this time to dump information about the network path that the slow replies were using. And this is when I found the cause.
As discussed here, our fileservers and backends are connected together over two different iSCSI 'networks', really just a single switch for each network that everything is plugged into. For reasons beyond the scope of this entry, we use a different model of switch on the two networks. It turned out that all of our delays were coming from traffic over one network and we were able to conclusively establish that the problem was that network's switch. Among other things, simply changing the mail spool fileserver to not use that network any more made an immediate and drastic change for the better in mail spool performance, giving us the SSD-level response times that we should have had all along.
Ironically, this switch was the higher-end model of the two switches and a model that we had previously completely trusted (it's used throughout our network infrastructure for various important jobs). It works great for almost everything, but something about it just really doesn't like our iSCSI traffic. Our available evidence points to flow control issues and we have a plausible theory about why, but that'll take another entry.
One of the startling things about this for me is just how indirect the cause of the problem was from the actual symptoms. Right up until I identified the actual cause I was expecting it to be a software issue in either the Linux target code or the Solaris software stack (and I was dreading either, because both would be hard to fix). A switch problem with flow control was not even on my radar, so much so that I didn't even look at the iSCSI networks beyond verifying that we weren't even coming close to saturating them (and I didn't consider it worth dumping information about what network connection the problem iSCSI requests were using until right at the end).
The good news is that this story has a very happy ending. Not only were we able to fix the mail spool performance problems, but at a stroke we were able to improve performance for all of our fileservers. And the fix was easy; all we had to do was swap the problem switch for another switch (this time, using the same model of switch as on the good iSCSI network).
(The other good news is that this problem only took two weeks or so to diagnose and fix, which is a big change from the last serious mail spool performance problem I was involved with.)