Thinking about what we probably want for monitoring, metrics, etc
The more I poke at this, the more it feels like we should almost completely detach collecting metrics from our alerting. Most of what we want to alert on aren't metrics, and most plausible ongoing metrics aren't alertable. (Machine room temperature is a rare exception.)
Partly this is because there is almost no use in alerting on high system-level metrics that we can't do anything about. Our alertable conditions are mostly things like 'host down'.
(Yes, we are an all-pets place.)
Right now, what we have for all of this is basically a big ball of mud, hosted on an Ubuntu 14.04 machine (so we have to do something about it pretty soon). Today I wound up looking at Prometheus because it was mentioned to me that they'd written code to parse Linux's /proc/self/mountstats, and I was impressed by their 'getting started' demo, and it started thoughts circulating in my head.
Prometheus is clearly a great low-effort way to pull a bunch of system level metrics out of our machines (via their node exporter). But a significant amount of what we use for alerts with our current software for is status checks such as 'is the host responding to SSH connections', and it isn't clear that status checks fit very well into a Prometheus world. I'm sure we could make things work, but perhaps a better choice is to not try to fit a square peg into a round hole.
In contemplating this, I think we have four things all smashed together currently: metrics (how fast do IMAP commands work, what network bandwidth is one of our fileservers using), monitoring (amount of disk space used on filesystems, machine room temperature), status checks (does a host respond to SSH, is our web server answering queries), and alerting, which is mostly driven by status checks but sometimes comes from things we monitor (eg, machine room temperature). Metrics are there for their history alone; we'll never alert on them, often because there's nothing we can do about them in the first place. For monitoring we want both history and alerting, at least some of the time (although who gets the alerts varies). Our status checks are almost always there to drive alerts, and at the moment we mostly don't care about their history in that we never look at it.
(It's possible that we could capture and use some additional status information to help during investigations, to see the last captured state of things before a crash, but in practice we almost never do this with our existing status information.)
In the past when I focused my attention on this area I was purely thinking about adding metrics collection along side our existing system of alerting, status checking, monitoring, and some metrics. I don't think I had considered actively yanking alerting and status checks out from the others (for various reasons), and now it at least feels more likely that we'll do something this time around.
(Four years ago I planned to use graphite and collectd for metrics, but that never went anywhere. I don't know what we'd use today and I'm wary of becoming too entranced with Prometheus after one good early experience, although I do sort of like how straightforward it is to grab stats from hosts. Nor do I know if we want to try to connect our metrics & monitoring solution with our status checks & alerting solution. It might be better to use two completely separate systems that each focus on one aspect, even if we wind up driving a few alerts from the metrics system.)