When metrics disappear on updates with Prometheus Pushgateway

October 17, 2018

To simplify, Prometheus has a notion of current versus stale metrics. As you'd want, straightforward Prometheus queries (for instance, for your current CPU usage) return only current metrics. There are also a number of ways to push metrics into Prometheus from places like scripts, such as the node exporter's textfile collector, where your script can write files to a magic directory on a machine that is already running the node exporter, and the Pushgateway, where your script can use curl to just poke metrics straight into Prometheus.

(A metric here is a combination of a metric name and a set of labels with their values.)

If you use the node exporter's textfile collector, when your metrics stop being current is pretty straightforward. If Prometheus can't talk to the node exporter, all of your metrics go stale; if it can, any metrics that aren't there go stale. So if you remove your file entirely, all of your metrics go stale, while if you write a new version of the file that's missing some metrics, just those go stale. Basically the current state of the world is, well, current, and everything else is stale.

(However, if you write your file and let it sit for a month, those metrics are still current as far as Prometheus is concerned. The textfile collector exposes metrics for the most recent times those files were modified.)

Pushgateway famously does not work this way, in that metrics pushed to it have no expiry time and will be considered current by Prometheus for as long as the Pushgateway responds. To quote from When to use the Pushgateway's rundown of pitfalls of using Pushgateway:

  • The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

(Whether or not this is a feature depends on your usage.)

This is true in one sense and is not quite completely true in another sense. If you push metrics and then go away, it is true. But if you are regularly pushing new versions of metrics, as you would be regularly generating new versions of your metrics file for the node exporter's textfile collector, what metrics disappear when depends on both what metrics you push, especially what metric names, and whether you push them with POST or with PUT.

Here's an example. We start by pushing the following metrics to /metrics/job/test/instance/fred on our Pushgateway (the job and instance here form what Pushgateway calls a 'grouping key'):

sensor_temp{id="1"}   23.1
sensor_temp{id="2"}   25.6
sensor_switch{id="1"} 1

Then we push to the same URL with the following new version of our metrics, which no longer mentions either sensor_temp{id="2"} or sensor_switch{id="1"}:

sensor_temp{id="1"}   24.0

If you send this with a POST, Pushgateway will remove the old sensor_temp{id="2"} metric, making it stale, but will continue to expose sensor_switch{id="1"}. If you send this with PUT, Pushgateway removes both.

If you use PUT, Pushgateway assumes that you are completely authoritative for what metrics currently exist under your grouping key; any metrics that you didn't push are removed and become stale in Prometheus. If you use POST, Pushgateway assumes that you're only authoritative for the metric names that you're using in your push. Metric names that you didn't mention might be handled by some other job, so it doesn't touch metrics from them.

As the Pushgateway documentation mentions but does not explicitly explain, this means that a POST with an empty body does nothing except update the push_time_seconds metric for your group key; since you pushed no metric names, Pushgateway doesn't touch any of the existing metrics. If you did a PUT with an empty body, in theory you would get the same effect as DELETE (but Pushgateway may consider this an error, I haven't checked).

Given this, my opinion is you should normally use PUT when sending metrics to Pushgateway. If you actually want to have several things separately pushing to the same group key with POST, you need to explicitly coordinate who gets to use what metric name(s), because otherwise you will quietly have push sources stepping on each other's toes and things will probably get very confusing (as metrics become stale or current depending on who pushed last and when Prometheus scraped your Pushgateway).

(One use of POST is to explicitly only update the last pushed time, with no chance of touching any of the current metrics. In this use it's the Pushgateway equivalent of the Unix touch command.)

I think that it's kind of unfortunate that the Pushgateway README implicitly uses POST in their examples (by using curl with no special options). If I really wanted to try to shave this particular yak I suppose that I could always submit a pull request, although I wonder if it would be declined on the grounds of being too verbose and explaining the nominally obvious.

Sidebar: When persistence in your metrics is a feature

The short version is that I see a use for pushing metrics that basically represent general facts into Pushgateway and then letting it persist them for us. These facts are not per-host things (or at least not things that we really want to generate on the individual hosts), so while we could expose them through the Prometheus host's node exporter and textfiles, that seems a bit like a hack.

Some people would say 'don't put general facts into Prometheus metrics'. My answer is that there isn't really a better option due to the paucity of features in things like alerting rules; you get PromQL expressions and that's mostly it, so either you write lots of alert rules or put your facts where PromQL can get at them.

(Or perhaps I'm missing something.)

Written on 17 October 2018.
« Quickly bashing together little utilities with Python is nice
Why you should be willing to believe that ed(1) is a good editor »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Oct 17 23:10:05 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.