How big our Prometheus setup is (as of January 2020)

January 26, 2020

I talked about about our setup of Prometheus and Grafana, but what I didn't discuss then is how big it is on various measures; things like how much disk space our Prometheus database takes, how many endpoints we're monitoring, how many metrics we have, how much cardinality is involved, and so on. Today I feel like running down all of those numbers for various reasons.

We started our production Prometheus setup on November 21st and it's been up since then, although the amount of metrics we've collected has varied over time (generally going up). At the moment our metrics database is using 815 GB, including 5.7 GB of WAL. Over roughly 431 days, that's averaged about 1.9 GB a day (and over the past seven days, we seem to be growing at about 1.97 GB a day, so that's more representative of our current growth rate).

At the moment we have 674 different targets that Prometheus scrapes. These range from Blackbox external probes of machines to the Prometheus host agent, so the number of metrics from each target varies considerably. Our major types of targets are Blackbox checks other than pings (260), Blackbox pings (199), and the host agent (108 hosts).

In terms of metrics, Prometheus's status information is currently reporting that we have 1,101 different metrics and 479,161 series in total. Our highest cardinality metrics are the host agent's metrics for systemd unit states (53,470 series) and a local series of metrics for Linux's NFS mountstats that condense them down to only 27,146 series (if we used the host agent's native support for this information, there would be a lot more). Our highest cardinality label is 'user', which we use both for per-user disk space usage information and VPN usage (with mostly overlapping user names). Our highest source of series is the host agent, unsurprisingly, with 449,669 of our series coming from it. The second highest is Pushgateway, which is responsible for 16,047 series. If you want to find out this detail for your own Prometheus setup, the query you want is:

sort_desc( count({__name__!=""}) by (job) )

The systemd unit state reporting generates so many series because the host agent generates a metric for every unit it reports on for every systemd state the unit can be in:

node_systemd_unit_state{ ..., name="cron.service", state="activating"}   0
node_systemd_unit_state{ ..., name="cron.service", state="active"}       1
node_systemd_unit_state{ ..., name="cron.service", state="deactivating"} 0
node_systemd_unit_state{ ..., name="cron.service", state="failed"}       0
node_systemd_unit_state{ ..., name="cron.service", state="inactive"}     0

Five series for each system unit adds up fast, even if you only have the host agent look at systemd .service units (normally it looks at more).

At the moment Prometheus appears to be adding on average about 31,000 samples a second to its database. The Prometheus process is currently reporting about 3.4 GB of resident RAM (on a 32 GB machine), although that undoubtedly fluctuates based on how many people are looking at our Grafana dashboards at any given time, as well as things like WAL compaction. It's using about 10% to 15% of a nominal single CPU (on a four-core machine with HT enabled). Outside of periodic spikes (which are probably for WAL compaction), the server as a whole runs at about 300 KB to 400 KB a second of writes; including all activity, the long term write bandwidth is about 561 KB/s. The incoming network bandwidth over the long term is about 345 KB/sec. All of this shows that we're not exactly stressing the machine.

(The machine has 32 GB of RAM not for its ordinary needs but to deal with RAM spikes due to complex ad-hoc queries. I've run the machine out of memory before when it had 16 GB. With 32 GB, we have more headroom and have been able to raise Prometheus query limits so we can support longer time ranges in our dashboards.)

Written on 26 January 2020.
« A network interface losing and regaining signal can have additional effects (in Linux)
The real world is mutable (and consequences for system design) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 26 01:41:47 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.