Wandering Thoughts archives

2021-03-13

Prometheus and the case of the stuck metrics

My home desktop can go down, crash, or lock up every so often (for example when it gets too cold). I run Prometheus on it for various reasons, and when this happens I not infrequently wind up looking at graphs of various things (either in Prometheus or in Grafana). Much of the times, these graphs have a weird structure around the time of the crash. The various metrics will be wiggling back and forth as usual before the crash, but then they go flat and just run on in straight lines at some level before they disappear entirely. It took me a while to work out what was going on.

These flat results happen because Prometheus will look backward a certain amount of time in order to find the most recent sample in a time series, by default five minutes. When my machine goes down, no new samples are being written in any time series, so the last pre-crash sample is returned as the 'current' sample for the next five minutes or so, resulting in flat lines (or rate-based things going to zero). Essentially the time series has become stuck at its last recorded value.

If you've rebooted machines you're collecting metrics from or had Prometheus collectors fail, then looked at graphs of the relevant metrics, you may have noticed that you don't see this. This is because Prometheus is smart and has an explicit concept of stale entries. In particular, it will immediately mark time series as stale under the right conditions:

If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, that time series will be marked as stale. If a target is removed, its previously returned time series will be marked as stale soon afterwards.

What this means is that if a target fails to scrape, all time series from it are immediately marked as stale. If another machine goes down or a collector fails, that target scrape will fail (possibly after a bit of a timeout), and all of its time series go away on the spot. Instead of getting stuck time series in your graphs, you get an empty void.

What's special about my home machine is that I'm running Prometheus on the machine itself, and also that the machine crashed (or at least that the Prometheus process was terminated) instead of everything shutting down in an orderly way. When the machine Prometheus is running on just stops abruptly, Prometheus doesn't see any failed targets and it doesn't have a chance to do any cleanup it might normally do in an orderly shutdown. The only way for time series to disappear is through there being no samples in the past five minutes, so for the first few minutes of my home machine being down, I get stuck time series.

(It's not entirely clear to me what Prometheus does here when the main process shuts down properly. I would probably have to pull raw TSDB data with timestamps in order to be sure, and that's too much work right now.)

PrometheusStuckMetrics written at 00:52:18; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.