When metrics disappear on updates with Prometheus Pushgateway
To simplify, Prometheus has a notion of
current versus stale metrics. As you'd want, straightforward
Prometheus queries (for instance, for your current CPU usage) return only current metrics. There are also
a number of ways to push metrics into Prometheus from places like
scripts, such as the node exporter's textfile collector,
where your script can write files to a magic directory on a machine
that is already running the node exporter, and the Pushgateway, where your script can
use curl
to just poke metrics straight into Prometheus.
(A metric here is a combination of a metric name and a set of labels with their values.)
If you use the node exporter's textfile collector, when your metrics stop being current is pretty straightforward. If Prometheus can't talk to the node exporter, all of your metrics go stale; if it can, any metrics that aren't there go stale. So if you remove your file entirely, all of your metrics go stale, while if you write a new version of the file that's missing some metrics, just those go stale. Basically the current state of the world is, well, current, and everything else is stale.
(However, if you write your file and let it sit for a month, those metrics are still current as far as Prometheus is concerned. The textfile collector exposes metrics for the most recent times those files were modified.)
Pushgateway famously does not work this way, in that metrics pushed to it have no expiry time and will be considered current by Prometheus for as long as the Pushgateway responds. To quote from When to use the Pushgateway's rundown of pitfalls of using Pushgateway:
- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
(Whether or not this is a feature depends on your usage.)
This is true in one sense and is not quite completely true in another
sense. If you push metrics and then go away, it is true. But if you
are regularly pushing new versions of metrics, as you would be
regularly generating new versions of your metrics file for the node
exporter's textfile collector, what metrics disappear when depends
on both what metrics you push, especially what metric names, and
whether you push them with POST
or with PUT
.
Here's an example. We start by pushing the following metrics to
/metrics/job/test/instance/fred
on our Pushgateway (the job and
instance here form what Pushgateway calls a 'grouping key'):
sensor_temp{id="1"} 23.1 sensor_temp{id="2"} 25.6 sensor_switch{id="1"} 1
Then we push to the same URL with the following new version of our
metrics, which no longer mentions either sensor_temp{id="2"}
or sensor_switch{id="1"}
:
sensor_temp{id="1"} 24.0
If you send this with a POST
, Pushgateway will remove the old
sensor_temp{id="2"}
metric, making it stale, but will continue
to expose sensor_switch{id="1"}
. If you send this with PUT
,
Pushgateway removes both.
If you use PUT
, Pushgateway assumes that you are completely
authoritative for what metrics currently exist under your grouping
key; any metrics that you didn't push are removed and become stale
in Prometheus. If you use POST
, Pushgateway assumes that you're
only authoritative for the metric names that you're using in your
push. Metric names that you didn't mention might be handled by some
other job, so it doesn't touch metrics from them.
As the Pushgateway documentation mentions but does not explicitly
explain, this means that a POST
with an empty body does nothing
except update the push_time_seconds
metric for your group key;
since you pushed no metric names, Pushgateway doesn't touch any of
the existing metrics. If you did a PUT
with an empty body, in
theory you would get the same effect as DELETE
(but Pushgateway
may consider this an error, I haven't checked).
Given this, my opinion is you should normally use PUT
when
sending metrics to Pushgateway. If you actually want to have
several things separately pushing to the same group key with POST
,
you need to explicitly coordinate who gets to use what metric
name(s), because otherwise you will quietly have push sources
stepping on each other's toes and things will probably get very
confusing (as metrics become stale or current depending on who
pushed last and when Prometheus scraped your Pushgateway).
(One use of POST
is to explicitly only update the last pushed
time, with no chance of touching any of the current metrics. In
this use it's the Pushgateway equivalent of the Unix touch
command.)
I think that it's kind of unfortunate that the Pushgateway README
implicitly uses POST
in their examples (by using curl
with no
special options). If I really wanted to try to shave this particular
yak I suppose that I could always submit a pull request, although
I wonder if it would be declined on the grounds of being too verbose
and explaining the nominally obvious.
Sidebar: When persistence in your metrics is a feature
The short version is that I see a use for pushing metrics that basically represent general facts into Pushgateway and then letting it persist them for us. These facts are not per-host things (or at least not things that we really want to generate on the individual hosts), so while we could expose them through the Prometheus host's node exporter and textfiles, that seems a bit like a hack.
Some people would say 'don't put general facts into Prometheus metrics'. My answer is that there isn't really a better option due to the paucity of features in things like alerting rules; you get PromQL expressions and that's mostly it, so either you write lots of alert rules or put your facts where PromQL can get at them.
(Or perhaps I'm missing something.)
|
|