Having metrics has taught me I didn't really know how our systems behave

December 27, 2018

I'll start with a story.

I tweeted some screenshots from our Prometheus setup recently. As it happens, there's something interesting going on in the first one, which you can find a 1920x1010 'full sized' PNG of here. This is a dashboard of one of our Amanda backup servers during backups and thus as it writes data to the disk drives we use as 'tapes'. If you look at the 'disk write bandwidth' graph, you can see the steadily declining write rate to the yellow disk; it drops from about 125 MBytes/sec at the start of writing data to just under 100 MBytes/sec by the end. The disk is 100% utilized during these write-out periods where its bandwidth slowly drops, and over the same time the average write time also goes up. All of this is consistent; it happens on all of our backup servers and on all of our disks.

I don't know what causes any of this or what it means, or if it's important. But what I do know is we had no idea that this was happening until we looked at the metrics. And it's not the only thing about our systems that I've discovered by building dashboards and poking around in them. Over and over I've discovered things from our new metrics system, for example that our FTP area is surprisingly active.

What building a metrics system and using it has taught me is that I didn't really know how our systems behaved. I had some idea of the large scale, obvious things, and for some issues I had looked in specific even before we built our metrics system, but beyond that I was mostly ignorant. This is not really surprising in any complex system, which most environments are. There's just too much going on, too many factors influencing each other, too many things you don't know about until you measure and study the system as it really is, not how you imagine it to be.

(This is pretty much the corollary of how you don't know what matters for performance in a complex system, which means you can't really know what your systems are doing until you measure them.)

Having a metrics system is not the only way to discover surprising, previously unknown things about your systems, but it's one of the most wide-ranging ones. By collecting data broadly, without trying to restrict either what you gather or when you gather it to what you expect to be interesting, you enable retrospective serendipity (in addition to the power of being able to explore casual questions, which can itself lead to discoveries).

Of course, our new metrics don't magically let us fully understand our systems. Beyond the limitation created by what data we collect and don't collect, which constrains what we can even possibly see into, there's two more limitations. The first is that we only collect metrics every so often, so transient events that happen faster than that are either invisible or smoothed out (depending on what sort of metrics we're collecting). The second is that most of our metrics are only averages, and even over pretty short periods of time averages can be misleading (see here and here).

To put this in concrete terms, that write IO bandwidth decline we see in the graphs of our Amanda backup servers is real, but it's probably not the full story; all of the cautions I wrote about in my entry on Prometheus disk IO stats definitely apply. If we wanted to fully and deeply understand how our backup servers were behaving here, we'd need more detailed information, either histograms or even block IO traces.

(And even with a lot more data that let us precisely characterize what was going on, we might not understand why the drives were slowing down, because modern drives are opaque black boxes with a lot of things going on inside.)

PS: Of course, if you suspect that your servers are perhaps doing something undesirable, for instance your IMAP server might be doing excessive NFS operations, then it's a good idea to turn your metrics spotlight on that specific aspect of things. But that's a slightly different case, one where you sort of have a hypothesis and you want to test it. An additional advantage of having a metrics system is that when such cases come up you're much more likely to take some easy steps to check your theories of what's wrong, instead of just assuming you're right. As we've found out, it's important to make sure you actually understand your problem.

Comments on this page:

From at 2018-12-28 03:30:52:

It's caused by the normal speed vs offset curve for hard disks (and other spinning media), for example:

It becomes visible in your graph as speed vs time because you're using the disks as tapes, writing sequentially to them from start to finish.

By Etienne Dechamps at 2018-12-28 04:55:09:

...and the physical reason for the above is because spinning disks start writing at the edge of the disk, where linear speed is maximal. As the offset increases, the head moves closer to the centre of the disk, where linear speed is minimal. Every single hard drive in existence will show this behaviour; it is extremely mundane.

What I see is the yellow disk showing slower writes due to backpressure from IO contention. The average IO Queue size for the yellow disk is often maxed out, at 100 (Percent?), and the disk utilization is also maxed out at 100 (percent? space used?).

If your disks are clustered or raided, then the yellow disk could be failing, with higher rates of errors causing a higher need for correction.

Use smartctl to check the SMART report on the disk. Think about replacing it.

We usually monitor for SMART errors, and replace when they start showing up. This graph tells a story, but not the clearest, most direct story.

By cks at 2019-01-03 00:06:13:

The average IO queue size is in outstanding IO operations and the disk utilization is in percent. The yellow disk is not RAIDed or clustered, although there is a pair of disks somewhat visible that are in a RAID-0 (the dark blue and barely visible red that are being read from as well as written to). The writes are basically asynchronous and limited by the bandwidth to the yellow disk, and at 100% utilization the kernel is pushing write IO to disk as fast as it can.

It'd be interesting to get the actual disk service time, but we can't; Linux's disk IO stats don't provide it. All you can get is total wait time from submission to the block IO system through completion of the actual disk IO; that can rise (as it does here) because of disk service time or because of increasing queues. With flat average IO queue sizes we can sort of assume that the rise in overall write service time is due to increased disk service times, but it's not sure.

Written on 27 December 2018.
« The many return values of read() (plus some other cases)
An odd MIME Content-Disposition or two »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 27 20:59:02 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.