The size of our Prometheus setup as of May 2021
At this point we've been running our Prometheus setup in production since November 21st 2018. This start date matters because one of our peculiarities is that we have yet to expire any metrics; we've kept everything back to the start of production, which means that we now have about two and a half years of accumulated metrics. We're still running Prometheus on the same hardware from the end of 2019, with its database stored on a mirrored pair of 4 TB drives. So here's some numbers about that and other aspects of our setup, partly because I want to record them now.
At the moment, we have 2.1 TB of Prometheus database (out of what
df -h' rounds to 3.6 TB in the filesystem), and we're consuming
about 2.8 GB of additional space each day. We should last at least
another year at this rate, but not two. At that point we will
probably upgrade the server's database disks to a pair of larger
HDs; I'd guess that we'll aim for 12 TB or 14 TB HDs to have many
more years of room. Hard drives have provided enough performance
for us, and as usual it probably helps that we almost never look
very far back in time from right now or scan large ranges.
(Prometheus will query back to the start of our database without particular problems, although loading several years of data is not necessarily the fastest thing if you make a query with a big range.)
These days we're running at an ingestion rate of about 39,000 samples a second (this can be computed from the prometheus_tsdb_head_samples_appended_total metric). When we started this was around 21,000 samples a second, but it rose as we added more metrics and more machines. We currently scrape 720 sample sources (although a bunch of these are individual probes on Blackbox). Pushgateway is our largest single source of samples, currently running around 10k per ingestion, followed by a number of host agents (almost entirely because of extra metrics we've added). Over 590 of our 720 sources generate less than 200 samples per scrape.
(This comes from counting
up and looking at scrape_samples_scraped.)
Right now, those 720 scrape sources are mostly 119 host agents, 209 Blackbox ICMP pings, 301 other Blackbox checks, 52 script_exporter checks, and then everything else is in small quantities (although not necessarily small in samples; we only have one Pushgateway, but as mentioned it's the single largest source of samples). Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.
We currently have 111 alert rules (from summing up prometheus_rule_group_rules), all running at the default 15 second rule evaluation rate. Our rule evaluation time appears to be trivial, as I'd expect since we have almost no alert rules that look back more than a small amount of time. We don't have any recording rules; I used one or two back in the very beginning, then got rid of them once I understood more about what I was doing.
PS: On our normal Ubuntu servers, between 65% and 82% of the host agent metrics are actually our metrics, not node_exporter. This is because we generate our own relatively detailed NFS client metrics for all NFS mounts from our fileservers. On a server with full NFS mounts, which is most of them, this creates about 4,000 time series. This is still a lot fewer than the node_exporter mountstats collector would create if we enabled it.