Wandering Thoughts archives

2021-03-31

Understanding Prometheus' changes() function and what it can do for me

Recently, roidelapluie wrote an interesting comment on my entry wishing that Prometheus had features to deal with missing metrics that suggested answering my question about how many times alerts fired over a time interval with a clever (or perhaps obvious) use of the changes() function:

changes(ALERTS_FOR_STATE[1h])+1

When I tried this out, I had one of those 'how does this work' moments until I thought about it more. To understand why this works as well as it does, I'll start with the the documentation for changes():

For each input time series, changes(v range-vector) returns the number of times its value has changed within the provided time range as an instant vector.

If you have a continuous time series, one that has always existed within the time range, this gives you the number of times that its value has changed (which is not the same as the number of different values it's had across that time range). If this is a time series like the Blackbox's probe_success, which is either 0 or 1 depending on whether it succeeded, this will tell you how many times the probe has changed states between succeeding and failing.

(To work out how many times the probe has started to fail, it's not enough to divide changes() by two; you also need to know what the probe's state was at the start and the end of the time range.)

If you apply changes() to a continuous metric where the values reset every so often, you will get a count of how many times the values changed and thus how many times there was a value reset. For instance, if you make DNS SOA queries through Blackbox, you will get the zone's current serial number back as a probe_dns_serial metric and changes(probe_dns_serial[1w]) will tell you how many times you (or someone else) did zone updates over the past week (well, more or less, this is really only valid for your own authoritative DNS servers). Similarly, if you want to know how many times a host rebooted over the past week you can ask for:

changes( node_boot_time_seconds [1w] )

(Well, more or less. There are qualifications if your clocks are changing.)

What this example points out is the value of having a metric with a value that's fixed when some underlying thing changes (such as the system booting), instead of changing all of the time. What the Linux kernel really provides is 'seconds since boot', but if node_exporter directly exposed that it would change on every scrape and we could not use changes() this way.

If you apply changes() to a metric that's sometimes missing, such as ALERTS, the missing sections are ignored (the actual code is literally unaware of them as far as I can tell); what matters is the sequence of values for time series points that actually exist. When the time series always has a fixed value when it exists, such as the fixed ALERTS value of '1', changes() will always tell you that there are 0 changes over the time range for every time series with points within it. This is because the values of the time series points are always the same, and changes() is sadly blind to the time series appearing and disappearing.

If you apply changes() to a non-continuous metric where the value is reset when the time series reappears, you'll get a count that is one less than the number of times that the time series appears. This is the situation for ALERTS_FOR_STATE, where its value is the starting time of an alert. If a given alert was triggered only once, there's only one timestamp value and changes() will tell you it never changed. If a given alert was triggered twice, there are two timestamp values and changes() will tell you it changed once. And so on.

What all of this biases me towards is exposing some form of fixed timestamp in any situation where I may want to count the number of times something happens. This is probably so even if the underlying data is in the form of a duration ('X seconds ago'), as we saw with host boot times. If I don't have a timestamp, maybe I can come up with some other fixed number instead of just using a '1'. Of course this can be taken too far, since using a fixed '1' value has its own conveniences.

sysadmin/PrometheusChangesFunction written at 23:09:51; Add Comment

Systemd's NSS myhostname module surprised me recently

The other day I did a plain traceroute from my Fedora 33 office workstation (instead of my usual 'traceroute -n') and happened to notice that the first hop was being reported as '_gateway'. This is very much not the name associated with that IP address, so I was rather surprised and annoyed. Although I initially suspected systemd-resolved because of a Fedora 33 change to use it, the actual cause turned out to be the myhostname NSS module, which was listed relatively early in the hosts: line in my nsswitch.conf.

(However, it turns out that I would probably have seen the same thing if I actually was using systemd-resolved, which I'm not.)

If configured in nsswitch.conf, the myhostname module provides three services, only two of which have to do with your machine's hostname. The simplest one is that localhost and variants on it all resolve to the appropriate localhost IPv4 and IPv6 addresses, and those localhost IPv4 and IPv6 addresses resolve back to 'localhost' in gethostbyaddr() and its friends. The second one is that the exact system host name resolves to all of the IP addresses on all of your interfaces; this is the name that hostname prints, and nothing else. Shortened or lengthened variants of the hostname don't do this. As with localhost, all of these IP addresses also resolve back to the exact local host name. This is where the first peculiarity comes up. To quote the documentation:

  • The local, configured hostname is resolved to all locally configured IP addresses ordered by their scope, or — if none are configured — the IPv4 address 127.0.0.2 (which is on the local loopback) and the IPv6 address ::1 (which is the local host).

If you do a reverse lookup on 127.0.0.2, myhostname will always report that it has the name of your machine, even if you have configured IP addresses and so myhostname would not give you 127.0.0.2 as an IP address for your hostname. A reverse lookup of ::1 will report that it's called both 'localhost' and your machine's name.

The third service is that the hostname "_gateway" is resolved to all current default routing gateway addresses. As with the first two services, the IP addresses of these gateways will also be resolved to the name "_gateway", which is what I stumbled over when I actually paid attention to the first hop in my traceroute output.

The current manpage for myhostname doesn't document that it also affects resolving IP addresses into names as well as names into IP addresses. A very charitable person could say that this is implied by saying that various hostnames 'are resolved' into IP addresses, as proper resolution of names to IP addresses implies resolving them the other way too.

Which of these effects trigger for you depends on where myhostname is in your nsswitch.conf. For instance, if it's present at all (even at the end), the special hostname "_gateway" will resolve to your gateway IPs, and names like "aname.localhost" will resolve to your IPv4 and IPv6 localhost IPs (and probably 127.0.0.2 will resolve to your hostname). If it's present early, it will steal the resolution of more names and more IPs from DNS and other sources.

The myhostname NSS module is part of systemd and has worked like this for over half a decade (although it started out using "gateway" instead of "_gateway"). However, it's not necessarily packaged, installed, and configured along with the rest of systemd. Ubuntu splits it out into a separate package, libnss-myhostname, which isn't installed on our Ubuntu servers. Fedora packages it as part of 'systemd-libs', which means it's guaranteed to be installed, and appears to default to using it in nsswitch.conf.

(What I believe is a stock installed Fedora 33 VM image has a nsswitch.conf hosts: line of "files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] myhostname dns". You might think that this would make DNS results from systemd-resolved take precedence over myhostname, but in a quiet surprise systemd-resolved does this too; see the "Synthetic Records" section in systemd-resolved.service.)

PS: I don't know why I never noticed this special _gateway behavior before, since myhostname has been doing all of this for some time (and I've had it in my nsswitch.conf ever since Fedora started shoving it in there). Possibly I just never noticed the name of the first hop when I ran plain 'traceroute', because I always knew what it was.

PPS: The change from "gateway" to "_gateway" happened in systemd 235, released 2017-10-06. The "gateway" feature for myhostname was introduced in systemd 218, released 2014-12-10. All of this is from systemd's NEWS file.

linux/SystemdNSSMyhostname written at 00:46:16; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.