The size of our Prometheus setup as of June 2024
At this point we've been running our Prometheus setup since November 21st 2018, and have still not expired any metrics, so we have full resolution metrics data right back to the beginning. Three years ago, I wrote how big our setup was as of May 2021, and since someone on the Prometheus mailing list was recently asking how big a Prometheus setup you could run, I'm going to do an update on our numbers.
Our core Prometheus server is still a Dell 1U server, with 64 GB of RAM because we could put that much in and it's cheap insurance against high memory usage. The Prometheus time series database (TSDB) is in a mirrored pair of 20 TB HDDs (in 2021 we used 4 TB HDDs, but since then we ran out of space and moved). At the moment we have what 'du -h' says is 6.3 TB of disk space used. The disk space usage has been rising steadily over time; in 2019, 20 days of metrics took 35 GB, and these days they take 104 GB.
(In these two 20-day chunk directories I'm looking at, in 2019 we had 50073266852 samples for 465988 series, and in 2024 we had 130974619588 samples for 1460523 series, which we can broadly approximate as about triple.)
These days, we're running at an ingestion rate of about 73,000 samples a second, scraped from 947 different sample sources; the largest single source of things continues to be Blackbox probes. Our largest single source of samples is no longer Pushgateway (it is now way down) but instead the ZFS exporter we use to get highly detailed ZFS metrics; our most chatty ZFS fileserver generates 95,000 samples from it. Apart from that, the most chatty sample sources are the Prometheus host agents on some of our machines, which can generate up to 19,000 metrics, primarily due to some of our servers having a lot of CPUs. About 700 of our scrape sources generate less than 50 samples per scrape.
(Our scrape rates vary. Host agents and the Cloudflare eBPF exporter are scraped every 15 seconds, we ping most machines every 30 seconds, the ZFS exporter is scraped every 30 seconds, most other Blackbox checks happen every 89 seconds, and a bunch of other scrape targets are every 60 seconds or every 59 seconds (and I should probably regularize that).)
At the moment we're pulling host agent information from 143 machines, doing Blackbox ping checks for 232 different targets, performing 375 assorted Blackbox checks other than pings (a lot of them SSH checks), and running assorted other Prometheus exporters in smaller quantities that we scrape for various things. Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.
We've grown to 158 alert rules, all running on the default 15 second rule evaluation rate. Evaluation time of all of these alert rules appears to be relatively trivial.
The Prometheus host server has six CPUs and typically runs about 3% user CPU usage. Average inbound bandwidth is about 800 Kbytes/sec. Somewhat to my surprise, this CPU usage does include some amount of Prometheus queries (outside of rules evaluation), because it looks like some people do routinely look at Grafana dashboards and thus trigger Prometheus queries (although I believe it's all for recent data and queries for historical data are relatively rare).
None of this is necessarily a guide to what anyone else could do with Prometheus, or how much resources it would take to handle a particular environment. One of the things that may make our environment unusual is that since we use physical hardware, we don't have hosts coming and going on a regular basis and churning labels like 'instance'. Using Prometheus in the cloud, with a churn of cloud host instances, might have different resource needs.
(But I do feel it's an indication that you don't need a heavy duty server to handle a reasonable Prometheus environment.)
Comments on this page:
|
|