Monitoring tools should report timestamps (and what they're monitoring)
This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote some simple tools to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this:
40.33 MB/s RX 56.54 MB/s TX packets/sec: 50331 RX 64482 TX
I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me.
Recently we've been looking at a NFS fileserver lockup problem, and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles.
So now I have a version of these tools with a somewhat different output format:
e1000g1 23:10:08 14.11 MB/s RX 77.40 MB/s TX packets/sec: 37791 RX 66359 TX
Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines).
I was lucky here in that adding a timestamp plus typical interface
names still keep output lines under 80 characters. But even in cases
where adding this information would widen the output lines, well,
I can widen my
xterm windows and it's better to have this information
than to have to reconstruct it afterwards. So in the future I think
all of my monitoring tools are at least going to have an option to
add a timestamp and similar information, and they might print it all
the time if it fits (as it does here).
PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry.
PPS: This idea is not a miracle invention of mine by any means. In
fact I shamelessly copied it from how useful the timestamps printed
out by tools like
arcstat are. When I noticed how much I was using
those timestamps and how nice it was to be able to scroll back,
spot something odd, and say 'ah, this happened at ...' right away,
I smacked myself in the forehead and did it for all of the monitoring
commands I was using. Fortunately many OmniOS commands like
already have an option to add timestamps, although it's sometimes
kind of low-rent (eg
vmstat prints the timestamp on a separate
line, which doubles how many lines of output it produces and thus
halves the effective size of my scrollback buffer).