== Monitoring tools should report timestamps (and what they're monitoring) This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote [[some ../linux/SeeingNetworkBandwidth]] [[simple tools ../solaris/SolarisNetworkBandwidth]] to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this: .pn prewrap on > 40.33 MB/s RX 56.54 MB/s TX packets/sec: 50331 RX 64482 TX I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me. Recently we've been looking at [[a NFS fileserver lockup problem ../solaris/OmniOSNFSOverloadProblem]], and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles. So now I have a version of these tools with a somewhat different output format: > e1000g1 23:10:08 14.11 MB/s RX 77.40 MB/s TX packets/sec: 37791 RX 66359 TX Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines). I was lucky here in that adding a timestamp plus typical interface names still keep output lines under 80 characters. But even in cases where adding this information would widen the output lines, well, I can widen my _xterm_ windows and it's better to have this information than to have to reconstruct it afterwards. So in the future I think all of my monitoring tools are at least going to have an option to add a timestamp and similar information, and they might print it all the time if it fits (as it does here). PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry. PPS: This idea is not a miracle invention of mine by any means. In fact I shamelessly copied it from how useful the timestamps printed out by tools like _arcstat_ are. When I noticed how much I was using those timestamps and how nice it was to be able to scroll back, spot something odd, and say 'ah, this happened at ...' right away, I smacked myself in the forehead and did it for all of the monitoring commands I was using. Fortunately many OmniOS commands like _vmstat_ already have an option to add timestamps, although it's sometimes kind of low-rent (eg _vmstat_ prints the timestamp on a separate line, which doubles how many lines of output it produces and thus halves the effective size of my scrollback buffer).