2018-12-27
Having metrics has taught me I didn't really know how our systems behave
I'll start with a story.
I tweeted some screenshots from our Prometheus setup recently. As it happens, there's something interesting going on in the first one, which you can find a 1920x1010 'full sized' PNG of here. This is a dashboard of one of our Amanda backup servers during backups and thus as it writes data to the disk drives we use as 'tapes'. If you look at the 'disk write bandwidth' graph, you can see the steadily declining write rate to the yellow disk; it drops from about 125 MBytes/sec at the start of writing data to just under 100 MBytes/sec by the end. The disk is 100% utilized during these write-out periods where its bandwidth slowly drops, and over the same time the average write time also goes up. All of this is consistent; it happens on all of our backup servers and on all of our disks.
I don't know what causes any of this or what it means, or if it's important. But what I do know is we had no idea that this was happening until we looked at the metrics. And it's not the only thing about our systems that I've discovered by building dashboards and poking around in them. Over and over I've discovered things from our new metrics system, for example that our FTP area is surprisingly active.
What building a metrics system and using it has taught me is that I didn't really know how our systems behaved. I had some idea of the large scale, obvious things, and for some issues I had looked in specific even before we built our metrics system, but beyond that I was mostly ignorant. This is not really surprising in any complex system, which most environments are. There's just too much going on, too many factors influencing each other, too many things you don't know about until you measure and study the system as it really is, not how you imagine it to be.
(This is pretty much the corollary of how you don't know what matters for performance in a complex system, which means you can't really know what your systems are doing until you measure them.)
Having a metrics system is not the only way to discover surprising, previously unknown things about your systems, but it's one of the most wide-ranging ones. By collecting data broadly, without trying to restrict either what you gather or when you gather it to what you expect to be interesting, you enable retrospective serendipity (in addition to the power of being able to explore casual questions, which can itself lead to discoveries).
Of course, our new metrics don't magically let us fully understand our systems. Beyond the limitation created by what data we collect and don't collect, which constrains what we can even possibly see into, there's two more limitations. The first is that we only collect metrics every so often, so transient events that happen faster than that are either invisible or smoothed out (depending on what sort of metrics we're collecting). The second is that most of our metrics are only averages, and even over pretty short periods of time averages can be misleading (see here and here).
To put this in concrete terms, that write IO bandwidth decline we see in the graphs of our Amanda backup servers is real, but it's probably not the full story; all of the cautions I wrote about in my entry on Prometheus disk IO stats definitely apply. If we wanted to fully and deeply understand how our backup servers were behaving here, we'd need more detailed information, either histograms or even block IO traces.
(And even with a lot more data that let us precisely characterize what was going on, we might not understand why the drives were slowing down, because modern drives are opaque black boxes with a lot of things going on inside.)
PS: Of course, if you suspect that your servers are perhaps doing something undesirable, for instance your IMAP server might be doing excessive NFS operations, then it's a good idea to turn your metrics spotlight on that specific aspect of things. But that's a slightly different case, one where you sort of have a hypothesis and you want to test it. An additional advantage of having a metrics system is that when such cases come up you're much more likely to take some easy steps to check your theories of what's wrong, instead of just assuming you're right. As we've found out, it's important to make sure you actually understand your problem.
The many return values of read()
(plus some other cases)
In the Hacker News discussion on my entry on
finding a bug in GNU Tar, there
grew a sub-thread about the many different cases of read()
's
return value. This
is a good example of the potential complexity of the Unix API in
practice, and to illustrate it I'm going to run down as many of the
cases that I can remember. In all cases, we'll start with 'n =
read(fd, buf, bufsize)
'.
The simplest and most boring case is a full read, where n
is
bufsize
. This is the usual case when reading from files, except
at the end of file. However, you can get a full read in various
other cases if there is enough buffered input waiting for you. If
you get a full read from a TTY while in line buffered mode, the
final line in your input buffer may not be newline terminated. In
some cases this may even be the only line in your input buffer (if
you have a relatively small input buffer and someone stuffed some
giant input into it).
A partial read is where n
is larger than zero but less than
bufsize
. There are many causes of a partial read; you may have
hit end of file, you may be reading from a TTY in either regular
line buffered mode or raw mode, you may
be reading from the network and that's all of the network input
that's currently available, or if the read()
was interrupted by
a signal after it transferred some data into your buffer. There
are probably other cases, especially since it's not necessarily
standardized what conditions do and don't produce partial reads
instead of complete failures.
(The standard says
that read()
definitely will return a partial read if it's interrupted
by a signal after it's already read some data, but who knows if all
actual Unixes behave that way for all types of file descriptors.)
If you're reading from a TTY in line buffered mode, a partial read
doesn't mean that you have a full line (someone could have typed
EOF on a partial line), or that you have only
one line in your input buffer, not several (if a lot of input has
built up since you last read()
from standard input).
A zero read is where n
is zero. The most common case is that
you are at the end of file; alternately, you may have deliberately
performed a zero-sized read()
to see if you get any errors but
not gotten any. However, implementations are not obliged to return
available errors on a zero sized read; the standard merely says
that they 'may' do so.
On both TTYs and regular files, end of file is not necessarily a
persistent condition; one read()
may return 0 bytes read and then
a subsequent read()
can return a non-zero result. End of file is
guaranteed to be persistent on some other types of file descriptors,
such as TCP sockets and pipes.
What I'll call a signalling error read is where n
is -1 and
errno
is used to signal a number of temporary conditions. These
include that the read()
was interrupted by a signal, including
SIGALRM
, where you get EINTR
, and that read()
was used on a
file descriptor marked non-blocking and would have blocked (EAGAIN
and EWOULDBLOCK
). It's very possible to get signalling error reads
in situations where you don't expect them, for instance if someone
passed you a file descriptor that is in non-blocking mode (this can
happen with TTYs and is often good for both comedy and an immediate
exit by many shells). At this point they are ordinary error reads.
An ordinary error read is where n
is -1 and errno
is telling
you about various other error conditions. Some of these error
conditions may vanish if you read()
again, either at the same
offset (in a seekable file) or at a different offset, and some are
effectively permanent. It is possible to get ordinary error reads
on many sorts of file descriptors, including on TTYs (where you may
get EIO
under some circumstances). In practice, there is no
limit to what errno
s may be returned by
read()
under various circumstances; any attempt to be exhaustive
is futile, especially if you want to do so in portable code.
Official documentation on possible errno
values is no more than
a hint.
(Because people need to specifically recognize signalling error
read errno
values, they are much better documented and much more
adhered to. You can be pretty confident of what EAGAIN
means if
you get it as an errno
value on a read()
, although whether or
not you expected it is another matter.)
In addition to actual return values from read()
, at least two
additional things can happen when you perform a read()
. The first
is that your read()
can stall for an arbitrary amount of time,
up to and including days, even in situations where you expect it
to complete rapidly and it normally does (for example, reading from
files instead of the network). Second, in some circumstances trying
to read()
from a TTY will result in your program being abruptly
suspended entirely (in fact, your entire process group). This happens
if you're in an environment with shell job control and you aren't
the foreground process group.
All of this adds up to a relatively complex API, and a significant amount of it is implicit. There are some issues that are pretty well known, such as partial reads on network input, but there are others that people may not run into outside of unusual situations.