Monitoring tools should report timestamps (and what they're monitoring)

May 4, 2015

This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote some simple tools to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this:

 40.33 MB/s RX  56.54 MB/s TX   packets/sec: 50331 RX 64482 TX

I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me.

Recently we've been looking at a NFS fileserver lockup problem, and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles.

So now I have a version of these tools with a somewhat different output format:

e1000g1 23:10:08  14.11 MB/s RX  77.40 MB/s TX   packets/sec: 37791 RX 66359 TX

Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines).

I was lucky here in that adding a timestamp plus typical interface names still keep output lines under 80 characters. But even in cases where adding this information would widen the output lines, well, I can widen my xterm windows and it's better to have this information than to have to reconstruct it afterwards. So in the future I think all of my monitoring tools are at least going to have an option to add a timestamp and similar information, and they might print it all the time if it fits (as it does here).

PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry.

PPS: This idea is not a miracle invention of mine by any means. In fact I shamelessly copied it from how useful the timestamps printed out by tools like arcstat are. When I noticed how much I was using those timestamps and how nice it was to be able to scroll back, spot something odd, and say 'ah, this happened at ...' right away, I smacked myself in the forehead and did it for all of the monitoring commands I was using. Fortunately many OmniOS commands like vmstat already have an option to add timestamps, although it's sometimes kind of low-rent (eg vmstat prints the timestamp on a separate line, which doubles how many lines of output it produces and thus halves the effective size of my scrollback buffer).

Written on 04 May 2015.
« What I want to have in shell (filename) completion
Unix's pipeline problem (okay, its problem with file redirection too) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 4 23:58:52 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.