Prometheus and Grafana after a year (more or less)
We started up our permanent production Prometheus instance on November 21st of 2018, which means that it's now been running and saving metrics for over a year (actually over 13 months by now, because I'm bad at writing entries on time). Our Prometheus and Grafana setup hasn't been static over that time, but it also hasn't undergone any significant changes from our straightforward initial setup (which was essentially the same as our current setup, just with fewer additional third party exporters).
The current state of our Prometheus setup is that it's now a quiet, reliable, and issue free part of our infrastructure, one that we generally don't have to think about; it just sits there in the background, working as it should. Every so often we get an alert email, but not very often because we usually don't have problems. Periodically we may look at our Grafana dashboards to see how things are going and if there's anything we want to look at (I may do this more than my co-workers, because I tend to think the most about Prometheus).
In the earlier days of our deployment (especially the first six months), we had a bunch of learning experiences around things like mass alerts. I spent a fair amount of time working on alert rules, figuring what to monitor and how, working out how to do clever thing like reboot notifications, generating additional custom metrics in various ways, and building and modifying dashboards so they'd be useful, as well as the normal routine maintenance tasks. These days things are almost entirely down to the routine tasks of changing our lists of Prometheus scrape targets as we add and remove machines, and keeping up with new versions of Prometheus, Grafana, and other components.
(I still do like coming up with additional metrics and fiddling with dashboards and I indulge in it periodically, but I'm aware that this is something that I could tinker with endlessly without necessarily generating lots of value.)
Overall, I'm quite happy with how our Prometheus system has turned out. It's been trouble-free to operate and it's delivered (and continues to deliver) what we want for alerts and mostly what we want as far as dashboards go (and the failings there are mine, because I'm the one putting them together). Keeping up with new versions has been easy, and they've delivered a slow and generally reliable stream of improvements, especially in our Grafana dashboards.
(I'm not as happy with the complexity of both Prometheus and Grafana, but a lot of that complexity is probably inherent in anything with those capabilities. As far as building alerts, custom metrics, and so on goes, we would probably have had to do something similar to that for any system. We can't expect out of the box monitoring for custom systems and environments.)
At the same time, Prometheus and Grafana have not magically illuminated all of our mysterious issues. If anything, staring at Grafana dashboards and looking at direct Prometheus metrics while mysterious things were going on has made me more aware of what information we simply don't have about our systems and what they're doing. Prometheus only gives us some visibility, not perfect visibility, and that's really as expected.
(My suspicion is that we won't be able to do much better until Ubuntu 20.04 ships with a decently usable version of the eBPF toolset.)
On the whole, Prometheus has improved our life but not revolutionized it. We have better alerts and more insight than we used to, but this hasn't solved any big issues that we had before (partly because we didn't really have big issues). In some ways the largest improvement is simply that we now have more reassurance about our environment through having more visibility into it. Our dashboards mean that we can see at a glance that no TLS certificates are too close to expiring, no machines have too high a load, the mail queues are not too large, and so on (and if there are problems, we can see where they are at a glance too).