Wandering Thoughts archives

2021-02-28

Dot-separated DNS name components aren't even necessarily subdomains, illustrated

I recently wrote an entry about my pragmatic sysadmin view on subdomains and DNS zones. At the end of the entry I mentioned that we had a case where we had DNS name components that didn't create what I thought of as a subdomain, in the form of the hostnames we assign for the IPMIs of our servers. These names are in the form '<host>.ipmi.core.sandbox' (in one of our internal sandboxes), but I said that 'ipmi.core.sandbox' is neither a separate DNS zone nor something that I consider a subdomain.

There's only one problem with this description; it's wrong. It's been so long since I actually dealt with an IPMI hostname that I mis-remembered our naming scheme for them, which I discovered when I needed to poke at one by hand the other day. Our actual IPMI naming scheme puts the 'ipmi' bit first, giving us host names of the form 'ipmi.<host>.core.sandbox' (as before, for the IPMI for <host>; the host itself doesn't have an interface on the core.sandbox subnet).

What this naming scheme creates is middle name components that clearly don't create subdomains in any meaningful sense. If we have host1, host2, and host3 with IPMIs, we get the following IPMI names:

ipmi.host1.core.sandbox
ipmi.host2.core.sandbox
ipmi.host3.core.sandbox

It's pretty obviously silly to talk about 'host1.core.sandbox' being a subdomain, much more so than 'ipmi.core.sandbox' in my first IPMI naming scheme. These names could as well be 'ipmi-<host>'; we just picked a dot instead of a dash as a separator, and dot has special meaning in host names. The 'ipmi.core.sandbox' version would at least create a namespace in core.sandbox for IPMIs, while this version has no single namespace for them, instead scattering the names all over.

(The technicality here is DNS resolver search paths. You could use 'host1.core.sandbox' as a DNS search path, although it would be silly.)

PS: Tony Finch also wrote about "What is a subdomain?" in an entry that's worth reading, especially for historical and general context.

SubdomainsAndDNSZonesII written at 22:39:58; Add Comment

2021-02-27

My pragmatic sysadmin view on subdomains and DNS zones

Over on Twitter, Julia Evans had an interesting poll and comment:

computer language poll: is mail.google.com a subdomain of google.com? (not a trick question, no wrong answers, please don't argue about it in the replies, I'm just curious what different people think the word "subdomain" means :) )

the ambiguity here is that mail.google.com doesn't have its own NS/SOA record. An example of a subdomain that does have those things is alpha.canada.ca -- it has a different authoritative DNS server than canada.ca does.

This question is interesting to me because I had a completely different view of it than Julia Evans did. For me, NS and SOA DNS records are secondary things when thinking about subdomains, down at the level of the mechanical plumbing that you sometimes need. This may surprise people, so let me provide a quite vivid local example of why I say that.

Our network layout has a bunch of internal subnets using RFC 1918 private IP address space, probably like a lot of other places. We call these 'sandbox' networks, and generally each research group has one, plus there are various other ones for our internal use. All of these sandboxes have host names under an internal pseudo-TLD, .sandbox (yes, I know, this is not safe given the explosion in new TLDs). Each different sandbox has a subdomain in .sandbox and then its machines go in that subdomain, so we have machines with names like sadat.core.sandbox and lw-staff.printers.sandbox.

However, none of these subdomains are DNS zones, with their own SOA and NS records. Instead we bundle all of the sandboxes together into one super sized sandbox. zone that has everything. One of the reasons for this is that we do all of the DNS for all of these sandbox subdomains, so all of those hypothetical NS and SOA records would just point to ourselves (and possibly add pointless extra DNS queries to uncached lookups).

I think most system administrators would consider these sandbox subdomains to be real subdomains. They are different namespaces (including for DNS search domains), they're operated by different groups with different naming policies, we update them separately (each sandbox has its own DNS file), and so on. But at the mechanical level of DNS zones, they're not separate zones.

But this still leaves a question about mail.google.com: is it a subdomain or a host? For people outside of Google, this is where things get subjective. A (DNS) name like 'www.google.com' definitely feels like a host, partly because in practice it's unlikely that people would ever have a <something>.www.google.com. But mail.google.com could quite plausibly someday have names under it as <what>.mail.google.com, even if it doesn't today. So to me it feels more like a subdomain even if it's only being used as a host today.

(People inside Google probably have a much clearer view of what mail.google.com is, conceptually. Although even those views can drift over time. And something can be both a host and a subdomain at once.)

Because what I consider a subdomain depends on how I think about it, we have some even odder cases where we have (DNS) name components that I don't think of as subdomains, just as part of the names of a group of hosts. One example is our IPMIs for machines, which we typically call names like '<host>.ipmi.core.sandbox' (for the IPMI of <host>). In the DNS files, this is listed as '<host>.ipmi' in the core.sandbox file, and I don't think of 'ipmi.core.sandbox' as a subdomain. The DNS name could as well be '<host>-ipmi' or 'ipmi-<host>', but I happen to think that '<host>.ipmi' looks nicer.

(What is now our IPMI network is an interesting case of historical evolution, but that's a story for another entry.)

SubdomainsAndDNSZones written at 00:59:34; Add Comment

2021-02-24

How convenience in Prometheus labels for alerts led me into a quiet mistake

In our Prometheus setup, we have a system of alerts that are in testing, not in production. As I described recently, this is implemented by attaching a special label with a special value to each alert, in our case a 'send' label with the value of 'testing'; this is set up in our Prometheus alert rules. This is perfectly sensible.

In addition to alerts that are in testing, we also have some machines that aren't in production or that I'm only monitoring on a test basis. Because these aren't production machines, I want any alerts about these machines to be 'testing' alerts, even though the alerts themselves are production alerts. When I started thinking about it, I realized that there was a convenient way to do this because alert labels are inherited from metric labels and I can attach additional labels to specific scrape targets. This means that all I need to do to make all alerts for a machine that are based on the host agent's metrics into testing alerts is the following:

- targets:
    - production:9100
  [...]

- labels:
    send: testing
  targets:
    - someday:9100

I can do the same for any other checks, such as Blackbox checks. This is quite convenient, which encourages me to actually set up testing monitoring for these machines instead of letting them go unmonitored. But there's a hidden downside to it.

When we promote a machine to production, obviously we have to make alerts about it be regular alerts instead of testing alerts. Mechanically this is easy to do; I move the 'someday:9100' target up to the main section of the scrape configuration, which means it no longer gets the 'send="testing"' label on its metrics. Which is exactly the problem, because in Prometheus a time series is identified by its labels (and their values). If you drop a label or change the value of one, you get a different time series. This means that the moment we promote a machine to production, it's as if we dropped the old pre-production version of it and added a completely different machine (that coincidentally has the same name, OS version, and so on).

Some PromQL expressions will allow us to awkwardly overcome this if we remember to use 'ignoring(send)' or 'without(send)' in the appropriate place. Other expressions can't be fixed up this way; anything using 'rate()' or 'delta()', for example. A 'rate()' across the transition boundary sees two partial time series, not one complete one.

What this has made me realize is that I want to think carefully before putting temporary things in Prometheus metric labels. If possible, all labels (and label values) on metrics should be durable. Whether or not a machine is an external one is a durable property, and so is fine to embed in a metric label; whether or not it's in testing is not.

Of course this is not a simple binary decision. Sometimes it may be right to effectively start metrics for a machine from scratch when it goes into production (or otherwise changes state in some significant way). Sometimes its configuration may be changed around in production, and beyond that what it's experiencing may be different enough that you want a clear break in metrics.

(And if you want to compare the metrics in testing to the metrics in production, you can always do that by hand. The data isn't gone; it's merely in a different time series, just as if you'd renamed the machine when you put it into production.)

PrometheusHostLabelMistake written at 23:01:31; Add Comment

How (and where) Prometheus alerts get their labels

In Prometheus, you can and usually do have alerting rules that evaluate expressions to create alerts. These alerts are usually passed to Alertmanager and they are visible in Prometheus itself as a couple of metrics, ALERTS and ALERTS_FOR_STATE. These metrics can be used to do things like find out the start time of alerts or just display a count of currently active alerts on your dashboard. Alerts almost always have labels (and values for those labels), which tend to be used in Alertmanager templates to provide additional information along side annotations, which are subtly but crucially different.

All of this is standard Prometheus knowledge and is well documented, but what doesn't seem to be well documented is where alert labels come from (or at least I couldn't find it said explicitly in any of the obvious spots in the documentation). Within Prometheus, the labels on an alert come from two places. First, you can explicitly add labels to the alert in the alert rule, which can be used for things like setting up testing alerts. Second, the basic labels for an alert are whatever labels come out of the alert expression. This can have some important consequences.

If your alert expression is a simple one that just involves basic metric operations, for example 'node_load1 > 10.0', then the basic labels on the alert are the same labels that the metric itself has; all of them will be passed through. However, if your alert expression narrows down or throws away some labels, then those labels will be missing from the end result. One of the ways to lose metrics in alert expressions is to use 'by (...)', because this discards all labels other than the 'by (whatever)' label or labels. You can also deliberately pull in labels from additional metrics, perhaps as a form of database lookup (and then you can use these additional labels in your Alertmanager setup).

Prometheus itself also adds an alertname label, with the name of the alert as its value. The ALERTS metric in Prometheus also has an alertstate label, but this is not passed on to the version of the alert that Alertmanager sees. Additionally, as part of sending alerts to Alertmanager, Prometheus can relabel alerts in general to do things like canonicalize some labels. This can be done either for all Alertmanager destinations or only for a particular one, if you have more than one of them set up. This only affects alerts as seen by Alertmanager; the version in the ALERTS metric is unaffected.

(This can be slightly annoying if you're building Grafana dashboards that display alert information using labels that your alert relabeling changes.)

PS: In practice, people who use Prometheus work out where alert labels come from almost immediately. It's both intuitive (alert rules use expressions, expression results have labels, and so on) and obvious once you have some actual alerts to look at. But if you're trying to decode Prometheus on your first attempt, it and the consequences aren't obvious.

PrometheusAlertsWhereLabels written at 00:19:23; Add Comment

2021-02-22

How I set up testing alerts in our Prometheus environment

One of the things I mentioned in my entry on how our alerts are quiet most of the time is that I have some Prometheus infrastructure for 'testing' alerts. Rather than being routed to everyone (via the normal email destination), these alerts go to a special destination that only goes to interested parties (ie, me). There are a number of different ways to implement this in Prometheus, so the way I picked to do it isn't necessarily the best one (and in fact it enables a bad habit, which is for another entry).

The simplest way to implement testing alerts is to set them up purely in Alertmanager. As part of your Alertmanager routing configuration, you would have a very early rule that simply listed all of the alerts that are in testing and diverted them. This would look something like this:

- match_re:
    alertname: 'OneAlert|DubiousAlert|MaybeAlert'
  receiver: testing-email
  [any other necessary parameters]

The problem with this is that it involves more work when you set up a new testing alert. You have to set up the alert itself in your Prometheus alert rules, and then you have to remember to go off to Alertmanager and update the big list of testing alerts. If you forget or make a typo, your testing alerts go to your normal alert receivers and annoy your co-workers. I'm a lazy person, so I picked a more general approach.

My implementation is that all testing alerts have a special Prometheus label with a special value, and then the Alertmanager matches on the presence of this (Prometheus) label. In Alertmanager this looks like:

- match:
    send: testing
  receiver: testing-email

Then in each Prometheus alert rule, we explicitly add the label and the label value in each testing rule:

- alert: MaybeAlert
  expr: ....
  labels:
    [...]
    send: testing
  annotations:
    [...]

(We add some other labels for each alert, to tell us things such as whether the alert is a host-specific one or some other type of alert, like a machine room being too hot.)

This enables my laziness, because I only need to edit one file to create a new testing alert instead of two of them, and there's a lower chance of typos and omissions. It also has the bonus of keeping the testing status of an alert visible in the alert rule file, at the expense of making it harder to get a list of all alerts that are in testing. For me this is probably a net win, because I look at alert rules more often than I look at our Alertmanager configuration so I have a higher chance of seeing a still-in-testing rule in passing and deciding to promote it to production. And if I'm considering promoting a testing alert to full production status, I can re-read the entire alert in one spot while I'm thinking about it.

(Noisy testing rules get removed rapidly, but quiet testing rules can just sit there with me forgetting about them.)

PrometheusTestingAlerts written at 00:09:08; Add Comment

2021-02-09

Normal situations should not be warnings (especially not repeated ones)

Every so often (or really, too often), people with good intentions build a program that looks at some things or does some things, and they decide to have that program emit warnings or set status results if things are not quite perfect and as expected. This is a mistake, and it makes system administrators who have to deal with the program unhappy. An ordinary system configuration should not cause a program to raise warnings or error markers, even if it doesn't allow all of the things that a program is capable of doing (or that the program wants to do by default). In addition, every warning should be rate-limited in any situation that can plausibly emit them regularly.

That all sounds abstract, so let's make it concrete with some examples drawn from the very latest version (1.1.0) of the Prometheus host agent. The host agent gathers a bunch of information from your system, which is separated into a bunch of 'collectors' (one for each sort of information). Collectors may be enabled or disabled by default, and as part of the metrics that the host agent emits it can report if a particular collector said that it failed (what consitutes 'failure' is up to the collector to decide).

The host agent has collectors for a number of Linux filesystem types (such as XFS, Btrfs, and ZFS), for networking technologies such as Fibrechannel and Infiniband, and for network stack information such as IP filtering connection tracking ('conntrack'), among other collectors. All of the collectors I've named are enabled by default. Naturally, many systems do not actually have XFS, Btrfs, or ZFS filesystems, or Infiniband networking, or any 'conntrack' state. Unfortunately, of these enabled by default collectors, zfs, infiniband, fibrechannel, and conntrack all generate metrics reporting a collector failure on Linux servers that don't use those respective technologies. Without advance knowledge of the specific configuration of every server you monitor, this makes it impossible to tell the difference between a machine that doesn't have one of those things and a real collector failure on a machine that does have one and so should be successfully collecting information about them. But at least these failures only show up in the generated metrics. At least two collectors in 1.1.0 do worse by emitting actual warnings into the host agent's logs.

The first collector is for Linux's new pressure stall information. This is valuable information but of course is only supported on recent kernels, which means recent versions from Linux distributions (so, for example, both Ubuntu 18.04 and CentOS 7 use kernels without this information). However, if the host agent's 'pressure' collector can't find the /proc files it expects, it doesn't just report a collector failure, it emits an error message:

level=error ts=2021-02-08T19:42:48.048Z caller=collector.go:161 msg="collector failed" name=pressure duration_seconds=0.073142059 err="failed to retrieve pressure stats: psi_stats: unavailable for cpu"

At least you can disable this collector on older kernels, and automate that with a cover script that checks for /proc/pressure and disables the pressure collector if it's not there.

The second collector is for ZFS metrics. In addition to a large amount of regular ZFS statistics, recent versions of ZFS on Linux expose kernel information about the overall health of each ZFS pool on the system. This was introduced in ZFS on Linux version 0.8.0, which is more recent that the version of ZoL that is included in, for example, Ubuntu 18.04. Unfortunately, in version 1.1.0 the Prometheus host agent ZFS collector insists on this overall health information being present; if it isn't, the collector emits a warning:

level=warn ts=2021-02-09T01:14:09.074Z caller=zfs_linux.go:125 collector=zfs msg="Not found pool state files"

Since this is only part of the ZFS collector's activity, you can't disable just this pool state collection. Your only options are to either disable the entire collector, losing all ZFS metrics on say your Ubuntu 18.04 ZFS fileservers, or have frequent warnings flood your logs. Or you can take the third path of not using version 1.1.0 of the host agent.

(Neither the pressure collector nor the ZFS collector rate-limit these error and warning messages. Instead one such message will be emitted every time the host agent is polled, which is often as frequently as once every fifteen or even every ten seconds.)

NormalThingsNotWarnings written at 00:16:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.