Downsampling your metrics data is a compromise (what we could call a 'hack')

May 15, 2021

In a comment on my entry on the size of our Prometheus setup, mappu raised an interesting issue:

I've always thought the default infinite-retention of Prometheus' data stores to be ridiculous when coming from RRD-based monitoring solutions like Munin.

It's not useful to keep 15-second resolution indefinitely, RRD is the right thing to do, and it's a shame it's such a weird hack to get that behaviour on top of Prometheus.

I've come to disagree with the idea of downsampling your data by default. Today, if Prometheus offered me the possibility, I would not use it for as long as possible. The core reason is the same reason as why statistics should be gathered in raw form instead of sample to sample deltas; you can down-sample on the fly to go from high resolution to low resolution data, but you can never up-sample. Once you discard your high-resolution data, it's gone for good. So by default you should avoid losing data for as long as possible.

The two big reasons to use downsampled data are that it doesn't need as much disk space and you don't have to scan and process as much data when you're looking at it. But both are operational issues. If you had infinite space and infinite processing capacity, neither would matter and it would work just as well to keep high resolution data. So in the pragmatic world, I believe that you should default to keeping your metrics data in its original form until you're forced to change that.

(Of course this is where one side of things notes that metrics data is already downsampled from the original, which is either extremely fine-grained moment to moment statistics or observability traces. But everything is a compromise. We use the default Prometheus scrape interval of 15 seconds for our host metrics, not because I thought it through carefully but because it's the default and it seems to work okay for us.)

I can't definitely say we've ever required our full resolution data from a year or two years ago in order to solve a problem. But at the very least it's reassuring to me that I have that fine-grained data, if only so that if I ever need to run a detailed comparison between performance now and performance a year ago, I can be confident I have just as good data for then as I do for now.

PS: I agree that it would be nice if Prometheus had native support for downsampling data, instead of forcing you into various hacks in order to implement it externally (hacks that I'm not sure work very well if you want to set them up after you've accumulated a lot of historical data). But I think of this as a separate issue from whether you want to downsample by default.

Written on 15 May 2021.
« The size of our Prometheus setup as of May 2021
Unix job control and its interactions with TTYs (and shells) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat May 15 23:59:41 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.