2016-06-12
Some notes on adding exposed statistics to a (Go) program
As a slow little project for the past while, I have been adding some accessible statistics to my sinkhole SMTP server, using Go's expvar package. This has resulted in me learning lessons both about expvar in specific and the process of adding statistics in general.
My big learning experience is going to sound fairly obvious and trite: I only really figured out what statistics I wanted to expose through experimentation. I started out with the idea that counting some obvious things would be interesting (and to a certain extent they were), but I created many of the lot of stats by a process of looking at the current set and realizing that there was information I wanted to know or questions that I wanted answered that were not covered by existing things I was exposing. Sometimes trying to use the initial version of statistic showed me that it was too broad or needed some additional information in order to be useful.
The corollary to this is that what statistics you'll want depends in large part on what questions are interesting and informative for you, which depends on how you're using the program. A lot of my stats are focused on anti-spam related issues, because that's how I'm using my sinkhole SMTP server. Someone using it to collect email from a collection of nodes and tests might well want a significantly different set of statistics. This does make adding stats to a theoretically general program a somewhat tricky thing; I have no good answers to this currently.
(I have not tried to be particularly general in my current set of stats. Since this has been an experiment to play around with the idea, I've focused on making them interesting to me.)
Just exporting statistics from a program is less general than pushing
events and metrics into a full time series based metrics system,
but Go's expvar package and a few other tools like jq makes it
much easier to do the former (for a start, I don't need a metrics
system). Exporting statistics is also not as comprehensive as having
an event log or the like. Since I do sort of have an event log,
I've chosen to view my expvar stats as being an on-demand summary
of it, one that I can look at without having to actively parse the
log to count things up.
And on another obvious note, putting counters and so on in a hierarchical namespace is quite helpful for keeping things comprehensible and organized. To some extent a good hierarchy can substitute for not being able to come up with great names for individual statistics. And sometimes you have data with unpredictable names that has to be confined to a namespace.
(For instance, I track DNS blocklist hit counts. The names of DNS
blocklists are essentially arbitrary, so I put the whole set of
stats into a dnsbl_hits namespace. And because the expvar
package automatically publishes some general Go stats on things
like your program's memory usage, I put all of my stats under a
top-level name so it's easy to pick them out.)
2016-06-05
My approach for inspecting Go error values
Dave Cheney sort of recently wrote Don't just check errors, handle
them gracefully,
where he strongly suggested that basically you should never check
the actual values of errors. This is generally a great idea, but
sometimes you don't have a choice. For example, for a long time it
was the case that the only way to tell if your DNS lookup had hit
a temporary DNS error (such as 'no authoritative servers for this
domain are responding') or a permanent one ('this name doesn't
exist') was to examine the specific error that you received. While
net.DNSError had a .Temporary() function, it didn't return true
in enough cases; you had to go digging deeper to know.
(This was Go issue 8434 and has since been fixed, although it took a while.)
When I had to work around this issue (in code that I suppose I should now remove), I was at least smart enough to try the official way first:
var serverrstr = "server misbehaving"
func isTemporary(err error) bool {
if e, ok := err.(*net.DNSError); ok {
if e.Temporary() || e.Err == serverrstr {
return true
}
}
return false
}
Checking the official way first made it so that once this issue was
resolved, my code would immediately start relying on the official
way. Checking the error string only for net.DNSError errors made
sure that I wouldn't get false positives from other error types,
which seemed like a good idea at the time.
When I wrote this code I felt reasonably smart about it; I thought
I'd done about as well as I could. Then Dave Cheney's article showed
me that I wasn't quite doing this right; as he says in one section
('Assert errors for behaviour, not type'), I should have really
checked for .Temporary() through an interface instead of just
directly checking the error as a net.DNSError. After all, maybe
someday net.LookupMX() and company will return an additional type
of error in some circumstances that has a .Temporary() method;
if that would happen, my code here wouldn't work right.
(I even put some comments in musing about the idea, but then rejected
it on the grounds that the current net package code didn't do
that so there didn't seem to be any point. In retrospect that was
the wrong position to take, because I wasn't thinking about potential
future developments in the net package.)
I'm conflicted over whether to cast to specific error types if you
have to check the actual error value in some way (as I do here). I
think it comes down to which way is safer for the code to fail. If
you check the value through error.Error(), future changes in the
code you're calling may cause you to match on things that aren't
the specific error type you're expecting. Sometimes this will be
the right answer and sometimes it will be the wrong one, so you
have to weigh the harm of a false positive against the harm of a
false negative.