The size of our Prometheus setup as of May 2021

May 15, 2021

At this point we've been running our Prometheus setup in production since November 21st 2018. This start date matters because one of our peculiarities is that we have yet to expire any metrics; we've kept everything back to the start of production, which means that we now have about two and a half years of accumulated metrics. We're still running Prometheus on the same hardware from the end of 2019, with its database stored on a mirrored pair of 4 TB drives. So here's some numbers about that and other aspects of our setup, partly because I want to record them now.

At the moment, we have 2.1 TB of Prometheus database (out of what 'df -h' rounds to 3.6 TB in the filesystem), and we're consuming about 2.8 GB of additional space each day. We should last at least another year at this rate, but not two. At that point we will probably upgrade the server's database disks to a pair of larger HDs; I'd guess that we'll aim for 12 TB or 14 TB HDs to have many more years of room. Hard drives have provided enough performance for us, and as usual it probably helps that we almost never look very far back in time from right now or scan large ranges.

(Prometheus will query back to the start of our database without particular problems, although loading several years of data is not necessarily the fastest thing if you make a query with a big range.)

These days we're running at an ingestion rate of about 39,000 samples a second (this can be computed from the prometheus_tsdb_head_samples_appended_total metric). When we started this was around 21,000 samples a second, but it rose as we added more metrics and more machines. We currently scrape 720 sample sources (although a bunch of these are individual probes on Blackbox). Pushgateway is our largest single source of samples, currently running around 10k per ingestion, followed by a number of host agents (almost entirely because of extra metrics we've added). Over 590 of our 720 sources generate less than 200 samples per scrape.

(This comes from counting up and looking at scrape_samples_scraped.)

Right now, those 720 scrape sources are mostly 119 host agents, 209 Blackbox ICMP pings, 301 other Blackbox checks, 52 script_exporter checks, and then everything else is in small quantities (although not necessarily small in samples; we only have one Pushgateway, but as mentioned it's the single largest source of samples). Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.

We currently have 111 alert rules (from summing up prometheus_rule_group_rules), all running at the default 15 second rule evaluation rate. Our rule evaluation time appears to be trivial, as I'd expect since we have almost no alert rules that look back more than a small amount of time. We don't have any recording rules; I used one or two back in the very beginning, then got rid of them once I understood more about what I was doing.

PS: On our normal Ubuntu servers, between 65% and 82% of the host agent metrics are actually our metrics, not node_exporter. This is because we generate our own relatively detailed NFS client metrics for all NFS mounts from our fileservers. On a server with full NFS mounts, which is most of them, this creates about 4,000 time series. This is still a lot fewer than the node_exporter mountstats collector would create if we enabled it.

Comments on this page:

By mappu at 2021-05-15 01:56:25:

I've always thought the default infinite-retention of Prometheus' data stores to be ridiculous when coming from RRD-based monitoring solutions like Munin.

It's not useful to keep 15-second resolution indefinitely, RRD is the right thing to do, and it's a shame it's such a weird hack to get that behaviour on top of Prometheus.

By cks at 2021-05-16 00:01:33:

I've come to disagree with the idea of down-sampling metrics data by default, for reasons that I've now written up in MetricsDownsamplingNotIdeal.

Hi Chris. How much memory is this setup consuming?

By cks at 2021-05-22 00:25:15:

Jr: this is a good question but I don't have any answer that I fully trust. I've written up my best information in PrometheusMemoryUncertainty.

Written on 15 May 2021.
« The Bourne shell and Bash aren't the right languages for larger programs
Downsampling your metrics data is a compromise (what we could call a 'hack') »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat May 15 00:10:52 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.