The size of our Prometheus setup as of June 2024

June 11, 2024

At this point we've been running our Prometheus setup since November 21st 2018, and have still not expired any metrics, so we have full resolution metrics data right back to the beginning. Three years ago, I wrote how big our setup was as of May 2021, and since someone on the Prometheus mailing list was recently asking how big a Prometheus setup you could run, I'm going to do an update on our numbers.

Our core Prometheus server is still a Dell 1U server, with 64 GB of RAM because we could put that much in and it's cheap insurance against high memory usage. The Prometheus time series database (TSDB) is in a mirrored pair of 20 TB HDDs (in 2021 we used 4 TB HDDs, but since then we ran out of space and moved). At the moment we have what 'du -h' says is 6.3 TB of disk space used. The disk space usage has been rising steadily over time; in 2019, 20 days of metrics took 35 GB, and these days they take 104 GB.

(In these two 20-day chunk directories I'm looking at, in 2019 we had 50073266852 samples for 465988 series, and in 2024 we had 130974619588 samples for 1460523 series, which we can broadly approximate as about triple.)

These days, we're running at an ingestion rate of about 73,000 samples a second, scraped from 947 different sample sources; the largest single source of things continues to be Blackbox probes. Our largest single source of samples is no longer Pushgateway (it is now way down) but instead the ZFS exporter we use to get highly detailed ZFS metrics; our most chatty ZFS fileserver generates 95,000 samples from it. Apart from that, the most chatty sample sources are the Prometheus host agents on some of our machines, which can generate up to 19,000 metrics, primarily due to some of our servers having a lot of CPUs. About 700 of our scrape sources generate less than 50 samples per scrape.

(Our scrape rates vary. Host agents and the Cloudflare eBPF exporter are scraped every 15 seconds, we ping most machines every 30 seconds, the ZFS exporter is scraped every 30 seconds, most other Blackbox checks happen every 89 seconds, and a bunch of other scrape targets are every 60 seconds or every 59 seconds (and I should probably regularize that).)

At the moment we're pulling host agent information from 143 machines, doing Blackbox ping checks for 232 different targets, performing 375 assorted Blackbox checks other than pings (a lot of them SSH checks), and running assorted other Prometheus exporters in smaller quantities that we scrape for various things. Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.

We've grown to 158 alert rules, all running on the default 15 second rule evaluation rate. Evaluation time of all of these alert rules appears to be relatively trivial.

The Prometheus host server has six CPUs and typically runs about 3% user CPU usage. Average inbound bandwidth is about 800 Kbytes/sec. Somewhat to my surprise, this CPU usage does include some amount of Prometheus queries (outside of rules evaluation), because it looks like some people do routinely look at Grafana dashboards and thus trigger Prometheus queries (although I believe it's all for recent data and queries for historical data are relatively rare).

None of this is necessarily a guide to what anyone else could do with Prometheus, or how much resources it would take to handle a particular environment. One of the things that may make our environment unusual is that since we use physical hardware, we don't have hosts coming and going on a regular basis and churning labels like 'instance'. Using Prometheus in the cloud, with a churn of cloud host instances, might have different resource needs.

(But I do feel it's an indication that you don't need a heavy duty server to handle a reasonable Prometheus environment.)


Comments on this page:

With 73,000 samples per second and a 800 KiB/s bandwidth this is approximately 11 bytes per sample used for transfer, right (including all overhead like TLS etc. and probably transfer compression)? This sounds like bandwidth isn’t anything to think about even for large Prometheus setup.

Largest I've run was 3x Prometheus server pods using all of the 768GiB of memory on a 24xlarge AWS EC2 instance per pod. The HA standard there was 3x, one per AWS availability zone. EBS volumes maxed out at 16TiB (and K8s got into an unrecoverable error if you requested more). The setup scraped metrics from microservice pods spread across ~1000 nodes with 90 days of retention. We had smaller separate environments as well. This was just prod. A lot of buyin feom all the teams there defining their own alerts for their own metrics

Ww migrated first to Cortex and thrn to VictoriaMetrics. VM was much easier to operate and scale. The migration motivation eas that we couldn't 2x the memory snymore if we were actually to need more than 768GiB for Prometheus

My current place is a little more modest in scale. We run 2x Prometheus with a 2 hour retention window feeding into Thanos with a 30 day retention. The alerts are more focused on infrastructure metrics than business metrics here

You said:

[...] most other Blackbox checks happen every 89 seconds, and a bunch of other scrape targets are every 60 seconds or every 59 seconds.

Care to comment about the x9 periods? Sounds like at some point it would be harder (but not by much) to correlate stuff if the metrics do not align.

By cks at 2024-06-13 11:55:38:

The odd times are prime numbers, which I picked so that they can't possibly wind up aligned with some every 15/30/60/etc second thing on the target machines. A classic problem in data collection is that, eg, you do something once a minute from cron, and you collect some metric once a minute, and the two wind up always happening at the same time so you get a distorted picture of the metric because the cron thing is always happening at the same time. This is less likely in Prometheus even for regular times like '15 seconds', but prime numbers eliminate it entirely and things like Blackbox checks are not high resolution anyway.

(Also, two separate Prometheus checks are very unlikely to happen at the same time even if they are the same interval, because Prometheus deliberately spaces out the start time of scraping metrics from a particular target across the interval. So you'll never have perfect metric alignment anyway.)

Written on 11 June 2024.
« The NFS server 'subtree' export problem
The Linux kernel NFS server and reconnecting client NFS filehandles »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Jun 11 23:15:33 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.