The history and background of us using Prometheus
On Prometheus and Grafana after a year, a commentator asked some good questions:
Is there a reason why you went with a "metrics-based" (?) monitoring solution like Prometheus-Grafana, and not a "service-based" system like Zabbix (or Nagios)? What (if anything) was being used before the current P-G system?
I'll start with the short answer, which is that we wanted metrics as well as alerting and operating one system is simpler than operating two, even if Prometheus's alerting is not necessarily as straightforward as something intended primarily for that. The longer answer is in the history of how we got here.
Before the current Prometheus system, what we had was based on Xymon and had been in place sufficiently long that portions of it talked about 'Hobbit' (the pre-2009 name of Xymon, cf). Xymon as we were operating it was almost entirely a monitoring and alerting system, with very little to nothing in the way of metrics and metrics dashboards. We've understood for a long time that having metrics is important and we wanted to gather and use them, but we had never managed to turn this desire into actually doing anything (at one point I sort of reached a decision on what to build, but then I never actually built anything for various reasons).
In the fall of 2018 (last year), our existing Xymon setup reached a critical point where we couldn't just let it be, because it was hosted on an Ubuntu 14.04 machine. For somewhat unrelated reasons I wound up looking at Prometheus, and its quick-start demonstration sold me on the idea that it could easily generate useful metrics in our environment (and then let us see them in Grafana). My initial thoughts were to split metrics apart from alerting and to start by setting up Prometheus as our metrics system, then figure out alerting later. I set up a testing Prometheus and Grafana for metrics on a scratch server around the start of October.
Since we were going to run Prometheus and it had some alerting capabilities, I explored if it could more or less sufficiently cover our alerting needs. It turned out that it could, although perhaps not in an ideal way. However, running one system and gathering information once (more or less) is less work than also trying to pick a modern alerting system, set it up, and set up monitoring for it, especially if we wanted to do it on a deadline (with the end of Ubuntu's support for 14.04 looming up on it). We decided that we would at least get Prometheus in place now to replace Xymon, even if it wasn't ideal, and then possibly implement another alerting system later at more leisure if we decided that we needed to. So far we haven't felt a need to go that far; our alerts work well enough in Prometheus, and we don't have all that many custom 'metrics' that really exist only to trigger alerts.
(Things we want to alert on often turn out to also be things that we want to track over time, more often than I initially thought. We've also wound up doing more alerting on metrics than I expected us to.)
Given this history, it's not quite right for me to say that we chose Prometheus over other alternative metrics systems. Although we did do some evaluation of other options after I tried Prometheus's demo and started exploring it, what it basically boiled down to was we had decent confidence Prometheus could work (for metrics) and none of the other options seemed clearly better to the point where we should spend the time exploring them as well. Prometheus was not necessarily the best, it just sold us on that it was good enough.
(Some of the evaluation criteria I used turned out to be incorrect, too, such as 'is it available as an Ubuntu package'. In the beginning that seemed like an advantage for Prometheus and anything that was, but then we wound up abandoning the Ubuntu Prometheus packages as being too out of date.)