**2024-06-13**

## Using prime numbers for our Prometheus scrape intervals

When I wrote about the current size of our Prometheus setup I mentioned that some of our Prometheus *scrape intervals* (how often metrics are
collected from a metrics source) were unusual looking numbers like
59 seconds and 89 seconds, instead of conventional ones like 15,
30, or 60 seconds. These intervals are prime numbers, and we use
them deliberately so that our metrics collection and checks can't
become synchronized to some regular process that happens, for
example, once a minute.

Prometheus already scatters the start times of metrics collection within their interval, so synchronization isn't necessarily very likely, but using prime numbers adds an extra level of insurance. At the same time, using prime numbers that are very close to exact times like '60 seconds' or '90 seconds' means that we have relatively good odds of periodically doing our check at exactly the start of a minute, or a 30 second interval, or the like, so that if there is something that happens at :00 or :30 or the like we'll probably observe it sooner or later (although we may not understand what we're seeing).

My feeling is that this irregularity is less important in things that provide cumulative metrics (like most of the metrics from the Prometheus host agent) and more important for 'point in time' metrics of what the current state is, which generally includes Blackbox checks. Cumulative metrics will capture both spikes and quiet periods, but point in time metrics may be distorted by only being collected at busy times (or only at quiet times).

However, our current Prometheus configuration is certainly not being particularly systematic about what has a regular collection interval (like 15, 30, or 60 seconds) and what doesn't. I should probably go back through every collection target, figure out if it falls more into the 'cumulative' category or the 'point in time' category, and set its collection interval to match. This will probably wind up moving some things from being checked every 30 seconds to being checked every 29 (and maybe some from 60 to 59).

(All of this is probably not very important in practice, since the odds of synchronization are relatively low to start with.)

** (Previous day | Next day) **