2021-05-15
Downsampling your metrics data is a compromise (what we could call a 'hack')
In a comment on my entry on the size of our Prometheus setup, mappu raised an interesting issue:
I've always thought the default infinite-retention of Prometheus' data stores to be ridiculous when coming from RRD-based monitoring solutions like Munin.
It's not useful to keep 15-second resolution indefinitely, RRD is the right thing to do, and it's a shame it's such a weird hack to get that behaviour on top of Prometheus.
I've come to disagree with the idea of downsampling your data by default. Today, if Prometheus offered me the possibility, I would not use it for as long as possible. The core reason is the same reason as why statistics should be gathered in raw form instead of sample to sample deltas; you can down-sample on the fly to go from high resolution to low resolution data, but you can never up-sample. Once you discard your high-resolution data, it's gone for good. So by default you should avoid losing data for as long as possible.
The two big reasons to use downsampled data are that it doesn't need as much disk space and you don't have to scan and process as much data when you're looking at it. But both are operational issues. If you had infinite space and infinite processing capacity, neither would matter and it would work just as well to keep high resolution data. So in the pragmatic world, I believe that you should default to keeping your metrics data in its original form until you're forced to change that.
(Of course this is where one side of things notes that metrics data is already downsampled from the original, which is either extremely fine-grained moment to moment statistics or observability traces. But everything is a compromise. We use the default Prometheus scrape interval of 15 seconds for our host metrics, not because I thought it through carefully but because it's the default and it seems to work okay for us.)
I can't definitely say we've ever required our full resolution data from a year or two years ago in order to solve a problem. But at the very least it's reassuring to me that I have that fine-grained data, if only so that if I ever need to run a detailed comparison between performance now and performance a year ago, I can be confident I have just as good data for then as I do for now.
PS: I agree that it would be nice if Prometheus had native support for downsampling data, instead of forcing you into various hacks in order to implement it externally (hacks that I'm not sure work very well if you want to set them up after you've accumulated a lot of historical data). But I think of this as a separate issue from whether you want to downsample by default.
The size of our Prometheus setup as of May 2021
At this point we've been running our Prometheus setup in production since November 21st 2018. This start date matters because one of our peculiarities is that we have yet to expire any metrics; we've kept everything back to the start of production, which means that we now have about two and a half years of accumulated metrics. We're still running Prometheus on the same hardware from the end of 2019, with its database stored on a mirrored pair of 4 TB drives. So here's some numbers about that and other aspects of our setup, partly because I want to record them now.
At the moment, we have 2.1 TB of Prometheus database (out of what
'df -h
' rounds to 3.6 TB in the filesystem), and we're consuming
about 2.8 GB of additional space each day. We should last at least
another year at this rate, but not two. At that point we will
probably upgrade the server's database disks to a pair of larger
HDs; I'd guess that we'll aim for 12 TB or 14 TB HDs to have many
more years of room. Hard drives have provided enough performance
for us, and as usual it probably helps that we almost never look
very far back in time from right now or scan large ranges.
(Prometheus will query back to the start of our database without particular problems, although loading several years of data is not necessarily the fastest thing if you make a query with a big range.)
These days we're running at an ingestion rate of about 39,000 samples a second (this can be computed from the prometheus_tsdb_head_samples_appended_total metric). When we started this was around 21,000 samples a second, but it rose as we added more metrics and more machines. We currently scrape 720 sample sources (although a bunch of these are individual probes on Blackbox). Pushgateway is our largest single source of samples, currently running around 10k per ingestion, followed by a number of host agents (almost entirely because of extra metrics we've added). Over 590 of our 720 sources generate less than 200 samples per scrape.
(This comes from counting up
and looking at scrape_samples_scraped.)
Right now, those 720 scrape sources are mostly 119 host agents, 209 Blackbox ICMP pings, 301 other Blackbox checks, 52 script_exporter checks, and then everything else is in small quantities (although not necessarily small in samples; we only have one Pushgateway, but as mentioned it's the single largest source of samples). Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.
We currently have 111 alert rules (from summing up prometheus_rule_group_rules), all running at the default 15 second rule evaluation rate. Our rule evaluation time appears to be trivial, as I'd expect since we have almost no alert rules that look back more than a small amount of time. We don't have any recording rules; I used one or two back in the very beginning, then got rid of them once I understood more about what I was doing.
PS: On our normal Ubuntu servers, between 65% and 82% of the host agent metrics are actually our metrics, not node_exporter. This is because we generate our own relatively detailed NFS client metrics for all NFS mounts from our fileservers. On a server with full NFS mounts, which is most of them, this creates about 4,000 time series. This is still a lot fewer than the node_exporter mountstats collector would create if we enabled it.