Our setup of Prometheus and Grafana (as of the end of 2019)

December 28, 2019

I have written a fair amount about Prometheus, but I've never described how our setup actually looks in terms of what we're running and where it runs. Even though I feel that our setup is straightforward and small scale as Prometheus setups go, there's some value in actually writing it down, if only to show how you can run Prometheus in a modest environment without a lot of complexity.

The main Prometheus programs and Grafana all are on a single Dell 1U server (currently a Dell R230) running Ubuntu 18.04, with 32 GB of RAM, four disks, and an Intel dual Ethernet card. Two of the disks are mirrored system SSDs and the other two are mirrored 4 TB HDs that we use to hold the Prometheus TSDB metrics data. We use 4 TB HDs not because we have a high metrics volume but because we want a very long metrics retention time; we're currently aiming for at least four years. We use all four network ports on the Prometheus server in order to let the server be directly on several non-routed internal networks that we want to monitor machines on, in addition to our main internal (routed) subnet.

This server currently hosts Prometheus itself, Grafana, Alertmanager, Blackbox, and Pushgateway. Like almost all of our servers, it also runs the node exporter host agent. We use the upstream precompiled versions of everything, rather than the Ubuntu 18.04 supplied ones, because the Ubuntu ones wound up being too far out of date. In third party exporters, it has the script exporter, which we use for more sophisticated 'blackbox' checks, and the Apache exporter. The web servers for Prometheus, Grafana, Alertmanager, Pushgateway, and Blackbox are all behind an Apache reverse proxy that handles TLS and authentication.

As mentioned, almost all of our Ubuntu machines run the Prometheus host agent. Currently, our mail related machines also run mtail to generate some statistics from their logs, and our nVidia based GPU servers also run a hacked up version of this third party nVidia exporter. On basically all machines running the host agent, we have a collection of scripts that generate various metrics into text files for the host agent's textfile collector. Some of these are generic scripts that run on everything (for things like SMART metrics), but some are specific to certain sorts of machines with certain services running. The basic host agent and associated scripts and /etc/cron.d files are automatically installed on new machines by our install system; other things are set up as part of our build instructions for specific machines.

(I've sort of kept an eye on Grafana Loki but haven't actively looked into using it anywhere. I haven't actively explored additional Prometheus exporters; for the most part, our system level metrics needs are already covered.)

Prometheus, Alertmanager, and so on are all configured through static files, including for what targets Prometheus should scrape. We maintain all of these by hand (although they're in a Mercurial repository), because we're not operating at the kind of scale or rate of changes where we need to automatically (re)generate the list of targets, our alert rules, or anything like that. We also don't try to have any sort of redundant Prometheus or Alertmanager instances; our approach for monitoring Prometheus itself is fairly straightforward and simple. Similarly, we don't use any of Grafana's provisioning features, we edit dashboards in the Grafana UI and just let it keep everything in its grafana.db file.

(Our Grafana dashboards, Prometheus alert rules, and so on are basically all locally written for own specific needs and metrics setup. I would like to extract our Grafana dashboards into a text format so I could more conveniently version them in a Mercurial repository, but that's a someday project.)

We back up the Prometheus server's root filesystem, which includes things like /etc/prometheus and the grafana.db file (as well as all of the actual programs involved), but not the Prometheus TSDB metrics database, because that's too big. If we lose both mirrored HDs at the same time (or sufficiently close to it), we'll lose our database of past metrics and will have to start saving them again from the current point in time. We've decided that a deep history of metrics is nice to have but not sufficiently essential that we're going to do better than this.

We have a collection of locally written scripts and some Python programs that generate custom metrics, either on the Prometheus server itself or on other servers that are running relevant software (or sometimes have the necessary access and vantage point). For example, our temperature sensor monitoring is done with custom scripts that are run from cron on the Prometheus server and write to Pushgateway. Some of it could have been done with the SNMP exporter, but rolling our own script was the simpler way to get started. These days, a fair number of these scripts on the Prometheus server are run through the script exporter instead of from cron for reasons that need another entry. On our other machines, all of them run from cron and most of them write files for the textfile collector; a few publish to Pushgateway.

Written on 28 December 2019.
« The Unix C library API can only be reliably used from C
Prometheus and Grafana after a year (more or less) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Dec 28 00:51:37 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.