How far back we want our metrics to go depends on what they're for

January 9, 2024

I mentioned recently in passing (in this entry) that our Prometheus metrics system is currently set to keep metrics for 'ten years' (3650 days, which is not quite ten years given leap days) and we sort of intend to keep them forever. That got me to thinking about how sensible this is, and how much usage we have for metrics that go back that far (we're already keeping over five years of metrics). The best answer I can come up with is that it depends on what the metrics are for and about.

The obvious problem with metrics about system performance is that our systems change over time. As we turn over hardware for the same hosts, their memory gets bigger, their CPUs get faster, their disks can improve, their networking will get better, and so on. When we move to our new fileserver hardware with much more memory, a lot of the performance metrics for those machines will probably look different, and in any case they're going to have a different disk topology. To some extent this makes it dangerous to compare 'the server called X with how it was two years ago', because two years ago it might have been on different hardware (and sometimes it might have been doing somewhat different things; we move services between servers every so often).

On a broader level, it feels not too useful to compare current servers against their past selves unless we could plausibly return to their past selves. For example, we can't return to the Ubuntu 18.04 version of any of our servers, because 18.04 is out of support. If the 18.04 version of server X performed much better than the 22.04 or 20.04 version, well, we're still stuck with whatever we get on the current versions. However, there's some use in knowing that performance has gone down, if we can see that.

Some things we collect metrics for stay fixed for much longer, though; a prime example is machine room temperature metrics (we've been in our machine rooms for a long time). Having a long history of temperature metrics for a particular machine room could be useful to put numbers (or at least visualizations) on slowly worsening conditions that are only clear if we're comparing across years, possibly many. Of course there are various possible explanations for a long term worsening of temperatures, such as there being more servers in a machine room now, but at least we can start looking.

Certain sorts of usage and volume metrics are also potentially useful over long time scales. I don't know if we'll ever want to look at a half-decade or longer plot of email volume, but I can imagine it coming up. This only works for some usage metrics, because with others too many things are changing in the environment around them. Will we ever have a use for a half-decade plot of VPN usage? I suspect not because so much that could affect that has changed over half a decade (and likely will change in the future).

(My current feeling is that a really long metrics history isn't going to be all that useful for capacity planning for us, simply because I don't think we have anything that has such a consistent growth rate over half a decade or a decade. The past few years? Sure. The past half decade? That's getting chancy because a lot has changed in local usage patterns, never mind world events.)

All of this is irrelevant to us today, since Prometheus's current retention policies are all or nothing. If we wanted to keep only some metrics for an extended period of time, we'd have to somehow copy them off to elsewhere (possibly downsampling them in the process). But by the time we start running into limits on our normal Prometheus server, Prometheus may well have developed some additional features here.

PS: I suspect that we already have much longer Prometheus metrics retention that is at all common. I suspect that someday this may get us into trouble, as we're probably hitting code conditions that aren't well tested.


Comments on this page:

By anarcat at 2024-01-10 10:52:16:

I wonder: how do you actually implement retention now? beefy disks?

We keep only a year of samples here, and it's growing quite a bit. We only have two VMs for prometheus right now, so we could dedicate more resources to it, but we're at 100G right now. I guess we could just shift that by an order of magnitude and hit a terabyte, but it seems rather stiff for metrics we're rarely going to look at, especially in the level of details our current scrape interval provides (1 minute).

My current thinking is that we might have a primary Prometheus server that handles alerting and short-term, high frequency scraping (say 15-30s) and a secondary server that would extract those metrics with a much higher frequency (say 5-10m) and keep those potentially eternally... have you considered such a setup?

By cks at 2024-01-10 12:21:52:

Our retention is done with beefy disks, which is easy for us because we're running Prometheus on a physical server. We started with two mirrored 4 TB HDDs and moved to two mirrored 20 TB HDDs when the 4 TB ones got full enough. We haven't considered running multiple servers at different sampling intervals, partly because that would mean finding a second server and a second set of data disks for it.

(Prometheus also can't scrape things too slowly; you need to scrape faster than every five minutes to keep samples from going stale.)

Written on 09 January 2024.
« One of the things limiting the evolution of WebPKI is web servers
MFA today is both 'simple' and non-trivial work »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Jan 9 22:54:05 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.