Our setup of Prometheus and Grafana (as of the end of 2019)
I have written a fair amount about Prometheus, but I've never described how our setup actually looks in terms of what we're running and where it runs. Even though I feel that our setup is straightforward and small scale as Prometheus setups go, there's some value in actually writing it down, if only to show how you can run Prometheus in a modest environment without a lot of complexity.
The main Prometheus programs and Grafana all are on a single Dell 1U server (currently a Dell R230) running Ubuntu 18.04, with 32 GB of RAM, four disks, and an Intel dual Ethernet card. Two of the disks are mirrored system SSDs and the other two are mirrored 4 TB HDs that we use to hold the Prometheus TSDB metrics data. We use 4 TB HDs not because we have a high metrics volume but because we want a very long metrics retention time; we're currently aiming for at least four years. We use all four network ports on the Prometheus server in order to let the server be directly on several non-routed internal networks that we want to monitor machines on, in addition to our main internal (routed) subnet.
This server currently hosts Prometheus itself, Grafana, Alertmanager, Blackbox, and Pushgateway. Like almost all of our servers, it also runs the node exporter host agent. We use the upstream precompiled versions of everything, rather than the Ubuntu 18.04 supplied ones, because the Ubuntu ones wound up being too far out of date. In third party exporters, it has the script exporter, which we use for more sophisticated 'blackbox' checks, and the Apache exporter. The web servers for Prometheus, Grafana, Alertmanager, Pushgateway, and Blackbox are all behind an Apache reverse proxy that handles TLS and authentication.
As mentioned, almost all of
our Ubuntu machines run the Prometheus host agent. Currently, our mail
related machines also run mtail
to generate some statistics from their logs, and our nVidia based
GPU servers also run a hacked up version of this third party
nVidia exporter. On
basically all machines running the host agent, we have a collection
of scripts that generate various metrics into text files for the
host agent's textfile collector.
Some of these are generic scripts that run on everything (for things
like SMART metrics), but some are specific
to certain sorts of machines with certain services running. The
basic host agent and associated scripts and
/etc/cron.d files are
automatically installed on new machines by our install system; other things are set up as part
of our build instructions for specific machines.
(I've sort of kept an eye on Grafana Loki but haven't actively looked into using it anywhere. I haven't actively explored additional Prometheus exporters; for the most part, our system level metrics needs are already covered.)
Prometheus, Alertmanager, and so on are all configured through
static files, including for what targets Prometheus should scrape.
We maintain all of these by hand (although they're in a Mercurial
repository), because we're not operating at the kind of scale or
rate of changes where we need to automatically (re)generate the
list of targets, our alert rules, or anything like that. We also
don't try to have any sort of redundant Prometheus or Alertmanager
instances; our approach for monitoring Prometheus itself is fairly
straightforward and simple. Similarly,
we don't use any of Grafana's provisioning features, we edit
dashboards in the Grafana UI and just let it keep everything in its
(Our Grafana dashboards, Prometheus alert rules, and so on are basically all locally written for own specific needs and metrics setup. I would like to extract our Grafana dashboards into a text format so I could more conveniently version them in a Mercurial repository, but that's a someday project.)
We back up the Prometheus server's root filesystem, which includes
/etc/prometheus and the
grafana.db file (as well
as all of the actual programs involved), but not the Prometheus
TSDB metrics database, because that's too big. If we lose both
mirrored HDs at the same time (or sufficiently close to it), we'll
lose our database of past metrics and will have to start saving
them again from the current point in time. We've decided that a
deep history of metrics is nice to have but not sufficiently essential
that we're going to do better than this.
We have a collection of locally written scripts and some Python programs that generate custom metrics, either on the Prometheus server itself or on other servers that are running relevant software (or sometimes have the necessary access and vantage point). For example, our temperature sensor monitoring is done with custom scripts that are run from cron on the Prometheus server and write to Pushgateway. Some of it could have been done with the SNMP exporter, but rolling our own script was the simpler way to get started. These days, a fair number of these scripts on the Prometheus server are run through the script exporter instead of from cron for reasons that need another entry. On our other machines, all of them run from cron and most of them write files for the textfile collector; a few publish to Pushgateway.