How far back we want our metrics to go depends on what they're for
I mentioned recently in passing (in this entry) that our Prometheus metrics system is currently set to keep metrics for 'ten years' (3650 days, which is not quite ten years given leap days) and we sort of intend to keep them forever. That got me to thinking about how sensible this is, and how much usage we have for metrics that go back that far (we're already keeping over five years of metrics). The best answer I can come up with is that it depends on what the metrics are for and about.
The obvious problem with metrics about system performance is that our systems change over time. As we turn over hardware for the same hosts, their memory gets bigger, their CPUs get faster, their disks can improve, their networking will get better, and so on. When we move to our new fileserver hardware with much more memory, a lot of the performance metrics for those machines will probably look different, and in any case they're going to have a different disk topology. To some extent this makes it dangerous to compare 'the server called X with how it was two years ago', because two years ago it might have been on different hardware (and sometimes it might have been doing somewhat different things; we move services between servers every so often).
On a broader level, it feels not too useful to compare current servers against their past selves unless we could plausibly return to their past selves. For example, we can't return to the Ubuntu 18.04 version of any of our servers, because 18.04 is out of support. If the 18.04 version of server X performed much better than the 22.04 or 20.04 version, well, we're still stuck with whatever we get on the current versions. However, there's some use in knowing that performance has gone down, if we can see that.
Some things we collect metrics for stay fixed for much longer, though; a prime example is machine room temperature metrics (we've been in our machine rooms for a long time). Having a long history of temperature metrics for a particular machine room could be useful to put numbers (or at least visualizations) on slowly worsening conditions that are only clear if we're comparing across years, possibly many. Of course there are various possible explanations for a long term worsening of temperatures, such as there being more servers in a machine room now, but at least we can start looking.
Certain sorts of usage and volume metrics are also potentially useful over long time scales. I don't know if we'll ever want to look at a half-decade or longer plot of email volume, but I can imagine it coming up. This only works for some usage metrics, because with others too many things are changing in the environment around them. Will we ever have a use for a half-decade plot of VPN usage? I suspect not because so much that could affect that has changed over half a decade (and likely will change in the future).
(My current feeling is that a really long metrics history isn't going to be all that useful for capacity planning for us, simply because I don't think we have anything that has such a consistent growth rate over half a decade or a decade. The past few years? Sure. The past half decade? That's getting chancy because a lot has changed in local usage patterns, never mind world events.)
All of this is irrelevant to us today, since Prometheus's current retention policies are all or nothing. If we wanted to keep only some metrics for an extended period of time, we'd have to somehow copy them off to elsewhere (possibly downsampling them in the process). But by the time we start running into limits on our normal Prometheus server, Prometheus may well have developed some additional features here.
PS: I suspect that we already have much longer Prometheus metrics retention that is at all common. I suspect that someday this may get us into trouble, as we're probably hitting code conditions that aren't well tested.
|
|