Wandering Thoughts archives


Having metrics has taught me I didn't really know how our systems behave

I'll start with a story.

I tweeted some screenshots from our Prometheus setup recently. As it happens, there's something interesting going on in the first one, which you can find a 1920x1010 'full sized' PNG of here. This is a dashboard of one of our Amanda backup servers during backups and thus as it writes data to the disk drives we use as 'tapes'. If you look at the 'disk write bandwidth' graph, you can see the steadily declining write rate to the yellow disk; it drops from about 125 MBytes/sec at the start of writing data to just under 100 MBytes/sec by the end. The disk is 100% utilized during these write-out periods where its bandwidth slowly drops, and over the same time the average write time also goes up. All of this is consistent; it happens on all of our backup servers and on all of our disks.

I don't know what causes any of this or what it means, or if it's important. But what I do know is we had no idea that this was happening until we looked at the metrics. And it's not the only thing about our systems that I've discovered by building dashboards and poking around in them. Over and over I've discovered things from our new metrics system, for example that our FTP area is surprisingly active.

What building a metrics system and using it has taught me is that I didn't really know how our systems behaved. I had some idea of the large scale, obvious things, and for some issues I had looked in specific even before we built our metrics system, but beyond that I was mostly ignorant. This is not really surprising in any complex system, which most environments are. There's just too much going on, too many factors influencing each other, too many things you don't know about until you measure and study the system as it really is, not how you imagine it to be.

(This is pretty much the corollary of how you don't know what matters for performance in a complex system, which means you can't really know what your systems are doing until you measure them.)

Having a metrics system is not the only way to discover surprising, previously unknown things about your systems, but it's one of the most wide-ranging ones. By collecting data broadly, without trying to restrict either what you gather or when you gather it to what you expect to be interesting, you enable retrospective serendipity (in addition to the power of being able to explore casual questions, which can itself lead to discoveries).

Of course, our new metrics don't magically let us fully understand our systems. Beyond the limitation created by what data we collect and don't collect, which constrains what we can even possibly see into, there's two more limitations. The first is that we only collect metrics every so often, so transient events that happen faster than that are either invisible or smoothed out (depending on what sort of metrics we're collecting). The second is that most of our metrics are only averages, and even over pretty short periods of time averages can be misleading (see here and here).

To put this in concrete terms, that write IO bandwidth decline we see in the graphs of our Amanda backup servers is real, but it's probably not the full story; all of the cautions I wrote about in my entry on Prometheus disk IO stats definitely apply. If we wanted to fully and deeply understand how our backup servers were behaving here, we'd need more detailed information, either histograms or even block IO traces.

(And even with a lot more data that let us precisely characterize what was going on, we might not understand why the drives were slowing down, because modern drives are opaque black boxes with a lot of things going on inside.)

PS: Of course, if you suspect that your servers are perhaps doing something undesirable, for instance your IMAP server might be doing excessive NFS operations, then it's a good idea to turn your metrics spotlight on that specific aspect of things. But that's a slightly different case, one where you sort of have a hypothesis and you want to test it. An additional advantage of having a metrics system is that when such cases come up you're much more likely to take some easy steps to check your theories of what's wrong, instead of just assuming you're right. As we've found out, it's important to make sure you actually understand your problem.

sysadmin/MetricsAndUnderstandingMore written at 20:59:02; Add Comment

The many return values of read() (plus some other cases)

In the Hacker News discussion on my entry on finding a bug in GNU Tar, there grew a sub-thread about the many different cases of read()'s return value. This is a good example of the potential complexity of the Unix API in practice, and to illustrate it I'm going to run down as many of the cases that I can remember. In all cases, we'll start with 'n = read(fd, buf, bufsize)'.

The simplest and most boring case is a full read, where n is bufsize. This is the usual case when reading from files, except at the end of file. However, you can get a full read in various other cases if there is enough buffered input waiting for you. If you get a full read from a TTY while in line buffered mode, the final line in your input buffer may not be newline terminated. In some cases this may even be the only line in your input buffer (if you have a relatively small input buffer and someone stuffed some giant input into it).

A partial read is where n is larger than zero but less than bufsize. There are many causes of a partial read; you may have hit end of file, you may be reading from a TTY in either regular line buffered mode or raw mode, you may be reading from the network and that's all of the network input that's currently available, or if the read() was interrupted by a signal after it transferred some data into your buffer. There are probably other cases, especially since it's not necessarily standardized what conditions do and don't produce partial reads instead of complete failures.

(The standard says that read() definitely will return a partial read if it's interrupted by a signal after it's already read some data, but who knows if all actual Unixes behave that way for all types of file descriptors.)

If you're reading from a TTY in line buffered mode, a partial read doesn't mean that you have a full line (someone could have typed EOF on a partial line), or that you have only one line in your input buffer, not several (if a lot of input has built up since you last read() from standard input).

A zero read is where n is zero. The most common case is that you are at the end of file; alternately, you may have deliberately performed a zero-sized read() to see if you get any errors but not gotten any. However, implementations are not obliged to return available errors on a zero sized read; the standard merely says that they 'may' do so. On both TTYs and regular files, end of file is not necessarily a persistent condition; one read() may return 0 bytes read and then a subsequent read() can return a non-zero result. End of file is guaranteed to be persistent on some other types of file descriptors, such as TCP sockets and pipes.

What I'll call a signalling error read is where n is -1 and errno is used to signal a number of temporary conditions. These include that the read() was interrupted by a signal, including SIGALRM, where you get EINTR, and that read() was used on a file descriptor marked non-blocking and would have blocked (EAGAIN and EWOULDBLOCK). It's very possible to get signalling error reads in situations where you don't expect them, for instance if someone passed you a file descriptor that is in non-blocking mode (this can happen with TTYs and is often good for both comedy and an immediate exit by many shells). At this point they are ordinary error reads.

An ordinary error read is where n is -1 and errno is telling you about various other error conditions. Some of these error conditions may vanish if you read() again, either at the same offset (in a seekable file) or at a different offset, and some are effectively permanent. It is possible to get ordinary error reads on many sorts of file descriptors, including on TTYs (where you may get EIO under some circumstances). In practice, there is no limit to what errnos may be returned by read() under various circumstances; any attempt to be exhaustive is futile, especially if you want to do so in portable code. Official documentation on possible errno values is no more than a hint.

(Because people need to specifically recognize signalling error read errno values, they are much better documented and much more adhered to. You can be pretty confident of what EAGAIN means if you get it as an errno value on a read(), although whether or not you expected it is another matter.)

In addition to actual return values from read(), at least two additional things can happen when you perform a read(). The first is that your read() can stall for an arbitrary amount of time, up to and including days, even in situations where you expect it to complete rapidly and it normally does (for example, reading from files instead of the network). Second, in some circumstances trying to read() from a TTY will result in your program being abruptly suspended entirely (in fact, your entire process group). This happens if you're in an environment with shell job control and you aren't the foreground process group.

All of this adds up to a relatively complex API, and a significant amount of it is implicit. There are some issues that are pretty well known, such as partial reads on network input, but there are others that people may not run into outside of unusual situations.

unix/ReadManyReturnValues written at 00:05:53; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.