Prometheus will make persistent connections to agents (scrape targets)

November 30, 2021

The other day, I discovered something new to me about Prometheus:

Today I learned that the Prometheus server will keep persistent TCP/HTTP connections open to clients it's pulling data from if you're pulling data frequently enough (eg, every 15 seconds). This can matter if you add firewall rules that will block new connections to things.

Julien Pivotto added an important additional note:

It also [h]as an impact if you use DNS because we will only resolve when we make new connections

I discovered this when we set up some new firewall rules on a machine that should have blocked connections to its host agent (and did, for my tests with wget), but our Prometheus server seemed to find it still perfectly reachable. That was because the machine's new firewall rules allowed existing connections to continue, so the Prometheus server could still talk to the client's host agent to scrape it.

Prometheus itself doesn't do anything special to get these persistent connections. Instead, they're a feature of HTTP keep alives and the underlying Go HTTP library, specifically of a http.Transport. Prometheus enables KeepAlives in the Transport it uses, and currently allows a rather large number of idle connections (in NewRoundTripperFromConfig, in http_config.go), and allows them to stay idle for five minutes before being closed (which in practice means they won't go idle, because they're scraped more frequently than that).

One of the non-obvious consequences of this is that your Prometheus server process may use more file descriptors than you expect, especially if you have a big environment where you scrape a lot of different agents. Roughly speaking, every separate scrape target is a persistent connection and so a file descriptor. One exception is multiple probes to things like Blackbox, where each Blackbox instance will be one or perhaps a few connections but probably many more probes.

(Of course, now that I look at the file descriptor usage of our Prometheus server, the file descriptor usage for persistent connections to targets is a drop in the bucket compared to the file descriptor usage for Prometheus's TSDB files.)

I don't know if Prometheus closes and reopens these connections if (and when) it reloads its scrape targets through service discovery or whatever. In blithe ignorance of what Prometheus does, I'm not certain which I'd prefer. Not closing connections for existing target hosts that are going to continue involves less resource usage, but it also postpones finding problems like new firewall rules or DNS resolution problems.

Speaking of DNS problems, this means that if your local resolving DNS stops being able to resolve your hosts, your Prometheus server will still (probably) be scraping them even if you specify targets by hostname instead of IP. You won't get a sudden wave of scrape target failures because you can't resolve their DNS names, because your Prometheus server has already connected to them while your DNS was working. For us, this is a feature, as we already explicitly monitor our DNS resolvers.

(Of course if your local DNS resolution is broken, your Alertmanager may not be able to contact anything to tell you about it.)


Comments on this page:

By roidelapluie at 2022-05-01 18:49:42:

When a scrape fails, we close the connexion and ask for DNS again.

Written on 30 November 2021.
« The long term relative prices of M.2 NVMe drives and 2.5" SSDs
Unfortunately, damaged ZFS filesystems can be more or less unrepairable »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Nov 30 21:05:37 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.