Wandering Thoughts archives

2012-10-23

The problem of simulating random IO

Due to recent events, we're now rather interested in being able to measure, characterize, and track our disk IO performance. This, it turns out, presents some problems in the modern world.

Of course the gold standard thing to measure is your actual observed performance in production (and thanks to some work with DTrace, I can now actually do that). However, the problem with production performance is that so many things influence it that it's hard to know what changes in it mean. In particular, if we observe production performance slowing down we don't know if it's because we've got more load, different IO patterns than before, or a genuine problem that we can do something about. So we need to be able to do controlled tests that measure real disk performance, in detail. This means defeating prefetching, which means what we really want to measure is random IO performance.

In theory this is simple enough, given an existing large file; pick a block size, generate or get some high-quality random numbers, seek to those offsets (in blocks), read or write something, and repeat. In practice there is a significant problem with using genuinely random numbers to drive random IO: repeatability (and its closely related cousin, consistency). If I run my test today then run it again in a month and get different results, is that difference because our IO system genuinely changed or because I got different random numbers?

So what I really want is a fixed sequence of non-sequential IO that defeats operating system prefetching (at least). The problem with this in the modern world is that OSes and filesystems are getting disturbingly superintelligent about detecting patterns in your IO. Sequential forward and backwards? That's easy (everyone does at least sequential forwards). Forwards and backwards with a stride? That too. Multiple streams of any of the above, interleaved? There are filesystems that detect it (and I happen to be dealing with one of them). Coming up with a sequence of IO that defeats all of this is what they call an interesting problem.

(And one that I don't have a solution for.)

Which brings me to a small request for filesystem designers: please provide a way to turn off your superintelligent prefetching for specific IO. Yes, it's great, but sometimes people want to do real IO right through to the disks. Having this feature also turn off all caching (and turn off placing the data in the cache) is optional but probably appreciated. I suggest that you borrow the Linux O_DIRECT flag to open() rather than invent your own different interface.

(Providing a filesystem-wide or system-wide flag is not good enough. I don't want to turn off all prefetching on a production filesystem or fileserver so that I can accurately measure disk IO performance; that cure is worse than the disease.)

RandomIOProblem written at 00:43:04; Add Comment

2012-10-20

The issue with measuring disk performance through streaming IO

Suppose, not entirely hypothetically, that you are interested in measuring the performance of your disks. Of course you understand that averages are misleading and that latency is important, so you fire up a handy tracing IO performance tester that does streaming reads and dumps timing traces for each IO operation.

This might sound good, but I feel that using streaming IO for this is generally a mistake. It isn't a fatal one, but you are potentially throwing away information on latency and making it harder to be sure of any odd results you see. The problem is what prefetching does to your true timing information.

(Your streaming IO will be prefetched by any number of layers, right down to the disk itself. You may be able to turn off some of them, but probably not all.)

There are two cases, depending on how fast the rest of your program runs. If your program is comparatively slow, perhaps because you wrote it in an interpreted language for convenience, prefetching can completely destroy the real latency. If a prefetched IO completes before your program got around to asking for it, that's it, you don't know anything more about its latency than that (and you may not know how slow your program is unless you think to measure it). It could have taken 5 milliseconds or 500, either (for a sufficiently slow program) is the same. But you probably care very much about the difference.

If your program is sufficiently fast that it is not the limiting factor, you're going to outrun prefetching. Prefetching is not magic so if you can consume IO results faster than the bandwidth from the disks to your program, your program will wind up waiting for IO completion and so seeing latencies that are probably more or less typical. But I'm not convinced that you'll necessarily see the real details of unusually slow IOs and if there are patterns in IO latencies, prefetching may well blur them together. It's possible that you don't actually lose any information here, but if so it's something that I'd have to think through very carefully. The need to do that makes me cautious, so I think it's undesirable to use even this full-speed streaming IO while measuring latency.

So my conclusion: if you want to measure latency, you need to somehow avoid prefetching.

(The exception is if what you care about actually is latency during streaming IO. Or just long-term bandwidth during streaming IO, and you don't care about latency outliers and brief IO stalls.)

Sidebar: checking to see if your program is fast enough

This one is simple: look at the read bandwidth your program is getting, as compared to your typical simple brute-force bandwidth micro-benchmark. The closer your program comes to the maximum achievable bandwidth, the faster it is. If you hit full bandwidth, your code is not the bottleneck and any detailed latency information you get is as trustworthy as possible under the circumstances.

StreamingAndIOPerfMeasurement written at 02:04:54; Add Comment

2012-10-16

Switch flow control and buffering: what we think was wrong in our iSCSI network

We have a theory about what was wrong with our problematic iSCSI switch. To set the scene, the problematic switch is a higher-end switch, of the sort that are generally intended as core switches in a network backplane; this is in fact what we mostly use this model of switch for (where we've been quite happy with them). The switch that works well is a lower-end switch from the same company, with all of the basic functionality but less bells and whistles of various sorts. During troubleshooting, we noticed that the problem switch did not have flow control turned on while the good one did; in fact this is the default configuration for each model. Turning on flow control on the problem switch didn't solve the problem, but we've had issues before with flow control on this model of switch.

Now for the theory. Our ZFS fileservers generally issue 128 Kbyte reads; this is the default ZFS blocksize and ZFS always reads whole blocks regardless of how much you asked for. On a gigabit network, 128 Kbytes takes about a millisecond and a bit to transmit (how much more depends on the iSCSI overhead), and it's possible that an iSCSI backend will have several reads worth of data to send to the fileserver at the same time.

Suppose that a fileserver happens to issue 128 Kb iSCSI reads to two backends over the problematic network and the backends get the data from the disks at about the same time and thus both start trying to transmit to the fileserver at the same time. For the duration that both are trying to dump data on the fileserver, they are each transmitting at a gigabit to the switch, for an aggregate burst bandwidth of 2 Gbits; however, the fileserver only has a single 1 Gb link from the switch. For the few milliseconds that both backends want to transmit at once, things simply don't fit and something has to give. The switch can buffer one backend's Ethernet frames, rapidly flow control one backend, or simply drop the frames it can't transmit down the fileserver's link.

At this point I was going to insert our speculation about how lower-end networking gear often has bigger buffers than higher-end gear, but it turns out I don't have to. The company that made both switches has their data sheets online and they cover switch buffer memory, so I can just tell you that the higher-end switch has 1 mega-bit of buffer memory, ie 128 Kbytes, while the lower-end switch has 2 megabytes of it. Given iSCSI, TCP, and Ethernet overheads, the higher-end switch can't even buffer one full iSCSI read reply; the lower-end switch can buffer several.

This explains the symptoms we saw. The problem appeared under load and got worse as the load went higher because the more IO load a fileserver was under (especially random IO from multiple sources), the higher the chance that it would send reads to more than one backend at the same time over the same network path (the fileserver used both network paths to each backend on a round-robin basis). The problem was worse on the mail spool because we put the mail spool in a highly replicated ZFS pool, which raises the chance that more than one backend would be trying to send to the fileserver at once (the disk based pool was a four-way mirror and the SSD pool is a three-way mirror). And the relatively long network stalls were because TCP transmission on the backends was stalling out under conditions of random packet loss, which both shrunk the socket send buffer size and slowed down transmission.

(And now that I've written this, I suspect that we'd have seen significant TCP error counts for things like retransmissions if we'd looked.)

Sidebar: why our problematic iSCSI switch wasn't broken

The short version is that the switch we had iSCSI problems wasn't broken (or wasn't broken much); instead, we were using it wrong. Although we didn't know it, we needed a switch that prioritized buffers over absolute flat out switching speed. My strong impression is that this is exactly backwards from the priorities of higher end core backbone switches. To make a bad analogy, we were asking a Ferrari to haul a big load of groceries.

One thing I take away from this is that switches are not necessarily one size fits all, not in practice. Just because a switch works great in one role doesn't mean that it's going to drop into another one without problems.

SwitchFlowControlIssue written at 00:26:09; Add Comment

2012-10-10

Disk IO latency is often what matters

After recent experiences, I've become convinced that my current methods of testing and monitoring disk performance are what I'd call inadequate. Most of my testing and monitoring focuses on disk bandwidth, often streaming IO bandwidth because it's easy to be consistent with that. One problem with this is that random IO matters too, but the bigger problem is that I've come to believe that latency is what really affects your perceived performance.

Yes, you need good disk bandwidth in order to deliver decent performance. Given the things that can happen with modern disks, it's worth testing (and necessary to test), but it's not sufficient. In many cases, good bandwidth with bad latency will give you a terrible user experience, because latency is what you need to react promptly to user actions. Very little of what normal people do on systems is latency-insensitive. When you save a file, click on a mail message in your IMAP reader, or even run ls, how fast IO starts responding makes a huge difference to how the system feels.

I've further come to feel that what matters is not average latency but more like the 99% or 99.9% percentile points. Modern environments do a lot of disk IO so a 'one in a hundred operations' delay happens quite frequently, probably every thirty seconds or less if you're actively using the system (after all, 100 operations in 30 seconds is less than four a second). And the worse these long latencies are, the larger and more annoying the stall is from the user's perspective. It doesn't take much before the system is stuttering and painful to use.

As before, I'm certain that I've read this in various places before but getting smacked in the nose with it has made the whole thing that much more real.

(I'm going to have to think hard about how to test latencies in a useful, repeatable way. A good starting point will be the question of just what I want to measure.)

DiskLatencyImportance written at 02:12:43; Add Comment

2012-10-06

How averages mislead you

To follow up on my illustrated example of this, I wanted to talk about how averages mislead people. They do it in at least two different ways.

The first way that averages mislead is that they smooth out exceptions. The longer the amount of time you average across and the more activity you see, the more that an average will hide exceptional activity (well, burry it under a mass of normal activity). You generally can't do very much about the amount of activity, so if you want to spot exceptions using an average you need to look at your 'average' over very short time intervals. Our recent issue was a great example of this. Exceptionally slow disk activity that wasn't really visible in a 60-second average did sometimes jump out in a one-second average. Of course the problem with fast averages is that then you generate a lot of results to go through (and also it's noisy).

It's worth understanding that this is not a problem with averages as such. Since the purpose of averages is to smooth things out, using an average should mean that you don't care about exceptions. If you do care about exceptions you need a different metric. Unfortunately people don't always provide one, which is a problem. The corollary is that if you're designing the statistics that your system will report and you plan to only report averages, you should be really confidant that exceptions either won't happen or won't matter. And you're probably wrong about both parts of that.

(Exceptional activity does affect even a long-term average, but it often doesn't affect it enough for things to be obviously wrong. Instead of saying 'this is crazy', you say 'hmm, things are slower than I was expecting'.)

The second way that averages mislead is that they hide the actual distribution of values. The usual assumption with averages is that you have a nice bell-shaped distribution centered around the average, but this is not necessarily the case. All sorts of distributions will give you exactly the same average and they have very different implications for how your system works. A disk IO system with a normal distribution centered on the average value is likely to feel very different from a disk IO system that has, say, two normal distributions superimposed on top of each other, one significantly faster than the average and one significantly slower.

(This is where my ignorance of most of statistics kicks in, because I don't know if there's some simple metrics that will give you a sense of the actual distribution is or if you really need to plot the distribution somehow and take a look at it.)

My illustrated example involved both ways. The so-so looking average was hiding significant exceptions and the exceptions were not random outliers; instead they were part of a distinct distribution. In the end it turned out that what looked like one distribution was in fact two distinct distributions stacked on top of each other, but that's another entry.

MisleadingAveragesII written at 02:15:33; Add Comment

2012-10-04

Averages mislead: an illustrated example

Over about an hour recently, while backups were running, one of our Solaris fileservers had an average iSCSI operation completion time of 33 milliseconds. Not great, especially since its iSCSI backends are SSD-based, but not terrible. This is pretty typical for this fileserver under load and while these particular numbers were gathered with DTrace, they agree with the output of things like 'iostat -zxn 60' (individual iSCSI disks could sometimes be slower over a 60-second period, but not hugely so and it fluctuated back and forth).

But averages mislead. (We all know that, right?)

That fileserver did a bit over twice as many reads as writes. Writes had an average completion time of 1 millisecond, but reads had an average completion time of 47 milliseconds. Suddenly things are not looking as great.

Still, would you have guessed that just over 6.5% of reads and writes took 100 milliseconds or more? That's about one in twenty (and they were almost entirely reads). On the sort of good side, 2.6% took 'only' between 200 and 300 milliseconds (and 0.8% between 100 and 200 milliseconds). But the long tail is what you could politely call extended; the slowest operation took just over a whopping 3200 milliseconds (3.2 seconds). It was probably a read; 3.4% of the reads took 512 milliseconds or longer.

(There was exactly one write that was quantized into the 2048-4095 millisecond bucket, so the longest operation prize just might go to it. I was not tracking the maximum service time by operation type at the time I pulled these stats.)

Looking at the distributions of write times against read times shows even more (and DTrace let me get them). The graph of write speeds shows a big peak all the way to the left in the fastest power of two quantization bucket and then a rapid decay and a tiny tail of slow operations; intuitively this is sort of what I'd expect. Reads peaked slower (in the 4-7 millisecond bucket) and somewhat more evenly, but they have a significant tail of slower operations with clear secondary peaked areas; this I did not expect at all.

(The write peak is 83% of all writes with 7% next to it, the read peak is only 49% and has 30% more in the two buckets immediately beside it.)

All of this was hiding inside an average of 33 milliseconds.

I knew intellectually that averages were like this, that they can hide all sorts of things and that you want to look at 99% percentiles and all the rest. I've read any number of presentations and writeups about this, looked at illustrative graphs showing spikes that were hiding in broad overviews, and so on. But there turns out to be a difference between reading about something and having it smack you in the face, and I never really expected to find quite so many disturbing things under this particular rock.

(Once I started looking I could see some of this even in tools that showed averages. For example, a 1-second iostat on the fileserver periodically showed drastic average service time spikes even though they got smoothed away in my usual 60-second view.)

MisleadingAverages written at 01:57:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.