Using prime numbers for our Prometheus scrape intervals

June 13, 2024

When I wrote about the current size of our Prometheus setup I mentioned that some of our Prometheus scrape intervals (how often metrics are collected from a metrics source) were unusual looking numbers like 59 seconds and 89 seconds, instead of conventional ones like 15, 30, or 60 seconds. These intervals are prime numbers, and we use them deliberately so that our metrics collection and checks can't become synchronized to some regular process that happens, for example, once a minute.

Prometheus already scatters the start times of metrics collection within their interval, so synchronization isn't necessarily very likely, but using prime numbers adds an extra level of insurance. At the same time, using prime numbers that are very close to exact times like '60 seconds' or '90 seconds' means that we have relatively good odds of periodically doing our check at exactly the start of a minute, or a 30 second interval, or the like, so that if there is something that happens at :00 or :30 or the like we'll probably observe it sooner or later (although we may not understand what we're seeing).

My feeling is that this irregularity is less important in things that provide cumulative metrics (like most of the metrics from the Prometheus host agent) and more important for 'point in time' metrics of what the current state is, which generally includes Blackbox checks. Cumulative metrics will capture both spikes and quiet periods, but point in time metrics may be distorted by only being collected at busy times (or only at quiet times).

However, our current Prometheus configuration is certainly not being particularly systematic about what has a regular collection interval (like 15, 30, or 60 seconds) and what doesn't. I should probably go back through every collection target, figure out if it falls more into the 'cumulative' category or the 'point in time' category, and set its collection interval to match. This will probably wind up moving some things from being checked every 30 seconds to being checked every 29 (and maybe some from 60 to 59).

(All of this is probably not very important in practice, since the odds of synchronization are relatively low to start with.)

Written on 13 June 2024.
« The Linux kernel NFS server and reconnecting client NFS filehandles
Mixed content upgrades on the web in mid 2024 »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jun 13 23:06:26 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.