Wandering Thoughts


Counting how many times something started or stopped failing in Prometheus

When I recently wrote about Prometheus's changes() function and its resets(), I left a larger scale issue not entirely answered. Suppose that you have a metric that is either 0 or 1, such as Blackbox's probe_success, and you want to know either how many times it's started failing or how many times it's stopped failing over a time interval.

Counting how many times a ((probe_success) time series has started to fail over a time interval is simple. As explained at more length in the resets() entry, we can simply use it:

resets( probe_success [1d] )

We can't do this any more efficiently than with resets(), because no matter what we do Prometheus has to scan all of the time series values across our one day range. The only way this could be more efficient would be if Prometheus gained some general feature to stream through all of the time series points it has to look at over that one-day span, instead of loading them all into memory.

Counting how many times a probe_success time series has started to succeed (after a failure) over the time interval is potentially more complex, depending how much you care about efficiency. The straightforward answer is to use changes() to count how many times it has changed state between success and failure and then use resets() to subtract how many times it started to fail:

changes( probe_success [1d] ) - resets( probe_success [1d] )

But unless Prometheus optimizes things, this will load one day's worth of every probe_success time series twice, first for changes() and then again for resets().

One approach to avoiding this extra load is to count changes and divide by two, but this goes wrong if the probe started out in a different state than it finished. If this happens, changes() will be odd and we will have a fractional success, which needs to be rounded down if the probe started out succeeding and rounded up if the probe started out failing. We can apparently achieve our desired rounding in a simple, brute force way as follows:

floor( ( changes( probe_success[1d] ) + probe_success )/2 )

What this does at one level is add one to changes() if the probe was succeeding at the end of the time period. This extra change doesn't matter if the probe started out succeeding, because then changes() will be even, the addition will make it odd, and then dividing by two and flooring will ignore the addition. But if the probe started out failing and ended up succeeding, changes() will be odd, the addition will make it even, and dividing by two will 'round up'.

However, this has the drawback that it will completely ignore time series that didn't exist at the end of the time period. Because addition in Prometheus does set union and the disappeared time series aren't present in the right side set, their changes() disappears entirely. As far as I can see, there is no way out of this that avoids a second full scan of your metric over the time range. At that point you might as well use resets().

For dashboard display purposes you might accept the simple 'changes()/2' approach with no clever compensation for odd changes() values, and add a note about why the numbers could have a <N>.5 value. Not everything on your dashboards and graphs has to be completely, narrowly correct all of the time even at the cost of significant overhead.

(This is one of the entries I'm writing partly for my future self. I'd hate to have to re-derive all of this logic in the future when I already did it once.)

PrometheusCountOnOrOff written at 00:01:41; Add Comment


Vendors put varied and peculiar things in system DMI information

Recently I read A Ceph war story, which is an alarming story where the day is saved in part by having a detailed hardware inventory that included things like firmware information. This inspired me to think about how we could collect similar information for our modest fleet of Ubuntu servers, which led me to dmidecode, which reports information about your system's hardware as your BIOS describes it according to the SMBIOS/DMI standard(s). When I actually started looking at DMI information for our systems, a number of interesting things started showing up.

The DMI table that dmidecode displays is composed of a bunch of records, of a bunch of different types. For hardware inventories, the most useful types to look at seem to be 'BIOS', 'System information', 'Baseboard' (ie, motherboard), 'Chassis', 'Processor', and 'Memory Module'. The information in each record varies, but many of them have serial numbers, product names, and manufacturers. Processors and memory modules are generally the most verbose, with lots of additional data that you (we) probably want to note down.

In theory DMI is available on most any x86 system. In practice it works best on standard servers from major manufacturers, because DMI is in the BIOS but theoretically contains information about the chassis the motherboard is in and the overall system. If you build a system yourself or a reseller builds it for you from vendor parts (such as our Linux fileservers), information on the chassis and the overall system is unknown to the BIOS and so it will either leave things blank or stick random things there. This and other factors leads to a wide variety of interesting things showing up in DMI data.

It will likely not surprise you that vendors have come up with a huge variety of ways of not having serial numbers, even in the same BIOS (one uses 'System Serial Number' and 'Default string' in different DMI records, for example). Merely being a number is no guarantee that it's a real serial number; there are serial numbers in our fleet (in various records) of 0123456789, 1234567890, 123456789, and 00000000, in addition to things like 'XXXXXXXX'. Some of these also show up in the 'version' field that many records have.

(Nor are serial numbers confined to numbers and hex digits. We have some that have /s, like paths, and quite a few with dots at the start and the end.)

In other places, some BIOS vendors have left helpful instructions for their OEM partners in field values such as 'To be filled by O.E.M.'. If you guessed that the OEM did not fill in this information, you would be correct. There are also plenty of fields with generic contents like 'Chassis Manufacture' (sic) and 'System Product Name', plus obvious things like 'Not Specified' and 'NONE'.

The DMI processor information has turned out to be interesting to handle because some vendors throw in extra processors that aren't there. This caused me to start capturing the 'Status' of each processor, which turned up other interesting cases. One machine reports that most of its processors are 'Populated, Idle' instead of 'Populated, Enabled' (despite them being in use at the time), and one machine claims that all of its processors are 'Populated, Disabled By BIOS'.

Memory modules are the things that seem to most often have real serial numbers, but even there it's not universal. Some modules don't have serial numbers, some have clearly bogus ones, and some have real looking ones that are duplicated between modules on the same system. This latter bit could be a BIOS issue, because obviously the BIOS is building part of the DMI table dynamically.

I wish that vendors would leave fields blank when they have no information, but I suspect that vendors have tried that and found it causes problems. For example, Supermicro is generally well regarded and they're one of the people who use obviously bad, all-numeric serial numbers in some records. I suspect that they have a reason for that.

(This sort of elaborates on some tweets.)

DMIVendorPeculiarities written at 00:54:34; Add Comment


Some uses for Prometheus's resets() function

One of the functions available in Prometheus's PromQL query language is resets(), which is described this way in the documentation:

For each input time series, resets(v range-vector) returns the number of counter resets within the provided time range as an instant vector. Any decrease in the value between two consecutive samples is interpreted as a counter reset.

resets should only be used with counters.

Up until recently I've ignored resets() because knowing when a counter had reset didn't seem particularly useful to me. This changed due to roidelapluie's comment about it on this entry (I'll get to it), which caused me to start thinking about resets() in general. But before I get to its possible uses, there's an important qualification on the documentation.

If a time series disappears for a while then reappears, the last value from before the disappearance is consecutive with the first value after it reappears. All that resets() cares about is the stream of values when a time series exists, so if the post-appearance value is lower the the pre-disappearance value, this is counted as a reset. Much like the changes() function, the code that evaluates this is completely blind to there even being periods of time where the time series isn't there. By extension, if a counter time series disappears for a while but comes back with the same value or higher, this won't be considered a reset.

(This makes reasonable sense if you think about scrape failures. You don't really want a scrape failure or a series of them to make Prometheus declare that counters have reset.)

As pointed out by roidelapluie, the first thing you can do with resets() is apply it to a continuous metric that's either 0 or 1, such as a success metric from Blackbox, in order to count how many times the time series has started to fail over a time interval (or more generally has gone to 0). Resets() is blind to what type the metric is, so when probe_success goes from 1 to 0 it will happily consider this a counter reset and count it for you.

(This won't work so well if the counter can take on additional non-zero values, because then every time the value goes down resets() will think it reset, even if it didn't go all the way down to 0. There are workarounds for this.)

Another thing you can do is apply resets() to non-continuous metrics that drop (or reset) when something interesting happens. For example, suppose you have a metric that is how many packets a VPN user's session has transmitted or received (with the time series for a user being absent if they have no sessions). You can be pretty certain that when a session is shut down and a new one started up, the new session will have a lower packet count than the old one. Given this, you can use resets() to count more or less how many times the user shut down their old VPN connection and made a new one over a time range, even if there was some time between the old session being shut down and the new one starting.

(As mentioned in my entry on changes(), it's probably better if you have a metric for the start time of a user's VPN session. But you may not always have that data, especially in a Prometheus metric.)

Applying resets() to a continuous metric that counts up until something happens will obviously tell you how many times that thing has happened over your time range. However, most counter metrics are associated with more directly usable indicators of things like host reboots or process restarts, and most gauge metrics are too likely to go down on their own instead of resetting.

The final use for resets() I can think of is telling you how many time a gauge metric goes down (as always, over some time range). This can be combined with changes() to let you determine how many times a gauge metric has gone up over a time range:

changes( some_metric[1h] ) - resets (some_metric[1h] )

I don't know if Prometheus will optimize this to only load the time series points for some_metric[1h] into memory once, or if it will do two loads (one for changes(), one for resets()). This might be an especially relevant thing to consider if you're using a subquery instead of a simple metric lookup.

Sidebar: resets() isn't less efficient if there are lots of resets

As far as I can tell from the code, resets() is just as efficient on metrics that have lots of resets as it is on metrics with only a few of them. So is changes(). In both cases, the code does a simple scan over all of the points in a time series and counts up the number of times it sees the relevant thing happening. The relevant Go code for resets() is small enough to just put here:

resets := 0
prev := samples.Points[0].V
for _, sample := range samples.Points[1:] {
   current := sample.V
   if current < prev {
   prev = current

The changes() code is the same thing with the condition changed slightly (including to account for that NaN's don't compare equal to each other). You can find all of the code in promql/functions.go.

PrometheusResetsFunction written at 23:25:03; Add Comment


Understanding Prometheus' changes() function and what it can do for me

Recently, roidelapluie wrote an interesting comment on my entry wishing that Prometheus had features to deal with missing metrics that suggested answering my question about how many times alerts fired over a time interval with a clever (or perhaps obvious) use of the changes() function:


When I tried this out, I had one of those 'how does this work' moments until I thought about it more. To understand why this works as well as it does, I'll start with the the documentation for changes():

For each input time series, changes(v range-vector) returns the number of times its value has changed within the provided time range as an instant vector.

If you have a continuous time series, one that has always existed within the time range, this gives you the number of times that its value has changed (which is not the same as the number of different values it's had across that time range). If this is a time series like the Blackbox's probe_success, which is either 0 or 1 depending on whether it succeeded, this will tell you how many times the probe has changed states between succeeding and failing.

(To work out how many times the probe has started to fail, it's not enough to divide changes() by two; you also need to know what the probe's state was at the start and the end of the time range.)

If you apply changes() to a continuous metric where the values reset every so often, you will get a count of how many times the values changed and thus how many times there was a value reset. For instance, if you make DNS SOA queries through Blackbox, you will get the zone's current serial number back as a probe_dns_serial metric and changes(probe_dns_serial[1w]) will tell you how many times you (or someone else) did zone updates over the past week (well, more or less, this is really only valid for your own authoritative DNS servers). Similarly, if you want to know how many times a host rebooted over the past week you can ask for:

changes( node_boot_time_seconds [1w] )

(Well, more or less. There are qualifications if your clocks are changing.)

What this example points out is the value of having a metric with a value that's fixed when some underlying thing changes (such as the system booting), instead of changing all of the time. What the Linux kernel really provides is 'seconds since boot', but if node_exporter directly exposed that it would change on every scrape and we could not use changes() this way.

If you apply changes() to a metric that's sometimes missing, such as ALERTS, the missing sections are ignored (the actual code is literally unaware of them as far as I can tell); what matters is the sequence of values for time series points that actually exist. When the time series always has a fixed value when it exists, such as the fixed ALERTS value of '1', changes() will always tell you that there are 0 changes over the time range for every time series with points within it. This is because the values of the time series points are always the same, and changes() is sadly blind to the time series appearing and disappearing.

If you apply changes() to a non-continuous metric where the value is reset when the time series reappears, you'll get a count that is one less than the number of times that the time series appears. This is the situation for ALERTS_FOR_STATE, where its value is the starting time of an alert. If a given alert was triggered only once, there's only one timestamp value and changes() will tell you it never changed. If a given alert was triggered twice, there are two timestamp values and changes() will tell you it changed once. And so on.

What all of this biases me towards is exposing some form of fixed timestamp in any situation where I may want to count the number of times something happens. This is probably so even if the underlying data is in the form of a duration ('X seconds ago'), as we saw with host boot times. If I don't have a timestamp, maybe I can come up with some other fixed number instead of just using a '1'. Of course this can be taken too far, since using a fixed '1' value has its own conveniences.

PrometheusChangesFunction written at 23:09:51; Add Comment


The attractions of reading sensor information from IPMIs

Most modern servers have some sort of onboard IPMI (sometimes called a service processor), and commonly the IPMI has access to information from various sensors, which it can provide to you on demand. Usually you can query the IPMI both locally and over the network (if you've connected the IPMI to the network at all). In addition, CPUs, motherboards, and various components such as NVMe drives and GPU cards can have their own sensors, which the server's operating system can expose to you. On Linux, this is done through the hwmon subsystem, using hwmon drivers for the various accessible sensor chips and sensors. Although generally it's a lot easier to use Linux's hwmon interface than querying your IPMI (and a lot more things will automatically look at it, such as host agents for metrics systems), there are still reasons to want to get sensor information from your server's IPMI.

The first reason is that you may not have a choice. On servers, some sensors may only be reported to the IPMI and not to the main server motherboard. I think this is especially common for things like power supply information and fan RPMs, where it may be significantly more complicated to provide readings to two places. If you don't go out and talk to the IPMI, all you may get is some basic temperature information and perhaps a few voltages. As far as I can tell, this is the case for many of our Dell servers.

A big reason to read sensor information from the IPMI even if you have a choice is that unlike the kernel, the IPMI is generally guaranteed to know what sensors it actually has, what they're all called (including things like which fan RPM sensor is for which fan), and how to get correct readings from all of them. All of these are areas where Linux and other operating systems can have problems even if there are motherboard sensors. On Linux, you need a driver for your sensor chipset, then you need to reverse engineer what sensor is where (or what), and you may also need to know magic transformations to get correct sensor readings. And even under the best circumstances, sometimes kernel sensor readings can go crazy. At least in theory, the IPMI has all of the magic hardware specific knowledge necessary to sort all of this out (at least for onboard hardware; you're probably on your own for, say, an add-in GPU).

If you talk to the IPMI over the network you can get at least some sensor information even if the server has hung or locked up, or the on-host metrics agent isn't answering you (perhaps because the server is overloaded). This may give you valuable clues as to why a server has suddenly become unresponsive, or at least let you rule some things out. This can also be your only option to get sensor metrics if you can't run an agent on the host itself for some reason. Over the network IPMI sensor collection will also give you some information if the main host is powered off, although how useful this is may vary. Hopefully you'll never have to care about remotely reading the ambient temperature around a powered off server.

IPMISensorsWhyQuery written at 00:19:02; Add Comment


My uncertainty about swapping and swap sizing for SSDs and NVMe drives

The traditional reason to avoid configuring a lot of swap space on your servers and to avoid using swap in general was that lots of swap space made it much easier for your system to thrash itself into total overload. But that's wisdom (and painful experience) from back in the days of system-wide 'global' swapping and your swap being on spinning rust (ie, hard drives). A lot of paging evicted memory back in (whether from swap or from its original files) is random IO and spinning rust had hard limits on how many IOPs a second it could do, which often had to be shared between swapping and real IO. And with global swapping, any process could be victimized by having to page things back in, or have its regular IO delayed by swapping IO. In theory, things could be different today.

Modern SSDs and especially NVMe drives are much faster and support many more IOPs a second, especially for read IO (ie, paging things back in). Paging is still quite slow if compared to simply accessing RAM, but it's not anywhere near as terrible as it used to be on spinning rust; various sources suggest that you might see page-in latencies of 100 microseconds or less on good NVMe drives, and perhaps only a few milliseconds on SSDs. Since modern SSDs and especially NVMe drives can reach and sustain very high random IO rates, this paging activity is also far less disruptive to other IO that other programs are doing.

(The figures I've seen for random access to RAM on modern machines is on the order of 100 nanoseconds. If we assume the total delay on a NVMe page-in is on the order of 100 microseconds (including kernel overheads), that means a page-in costs you around 1,000 RAM accesses. This is far better than it used to be, although it's not fast. Every additional microsecond of delay costs another 10 RAM accesses.)

Increasingly, systems also support 'local' swapping in addition to system wide 'global' swapping, where different processes or groups of processes have different RAM limits and so one group can be pushed into swapping without affecting other groups. The affected group will still pay a real performance penalty for all of the paging it's doing, but other processes should be mostly unaffected. They shouldn't have their pages evicted from RAM any faster than they otherwise would be, so if they weren't paging before they shouldn't be paging afterward. And with SSDs and NVMe drives having high concurrent IO limits, the other processes shouldn't be particularly affected by the paging IO.

If you're using SSDs or NVMe drives with enough IO capacity (and low enough latency), even system-wide swap thrashing might not be as lethal as it used to be. If everything works well with 'local' swapping, a particular group of processes could be pushed into swap thrashing by their excessive memory usage without doing anything much to the rest of the system; of course they might not perform well and perhaps you'd rather have them terminated and restarted. If all of this works, perhaps these days systems should have a decent amount of swap, much more than the minimal swap space that we have tended to configure so far.

(All of this is more true on NVMe drives than SSDs, though, and all of our servers still use SSDs for their system drives.)

However, all of this is theoretical. I don't know if it actually works in practice, especially on SSDs (where even a one millisecond delay for a page-in is the same cost as 10,000 accesses to RAM, and that's probably fast for SSDs). System wide swap thrashing on SSDs seems like a particularly bad case, and our most likely case on most servers. Per-user RAM limits seem like a better case for using a lot of swap, but even then we may not be doing people any real favours and they might be better off having the offending process just terminated.

(All of this was sparked by a Twitter thread.)

SwappingOnSSDUncertainty written at 23:40:42; Add Comment

My experience with x2go is that it's okay but not compelling

Various people's tweets and comments on earlier entries pushed me into giving x2go a bit of a try, despite having low expectations because 'seamless windows' in X are challenging for remote desktop software. The results are somewhat mixed and my view so far is that x2go isn't compelling for me. To start with, what I want from any program like this is that it work like 'ssh -X' but perform faster. I specifically don't want to run a remote desktop; I want to run X programs remotely.

On the positive side, I was genuinely surprised by how much worked. X2go properly supported multiple programs opening multiple top level X windows that work just like regular X windows, and it even arranged for their X windows on my local display to have the correct window title, X class and resource name, icon, properties, and so on. Cut and paste worked (at least in xterm). I could even suspend a session and then resume it, with all of the windows disappearing from my display and then reappearing later. In a lot of ways, the windows displayed for the remote programs acted like they were real windows created through 'ssh -X', which definitely helped the overall experience.

(My window manager cares about the X class and resource names, for example, because it treats windows from some programs specially, especially xterms. That all worked with x2go remote xterm windows.)

Performance felt somewhat better than 'ssh -X' and my monitoring suggests that x2go was using clearly less bandwidth on my DSL link. However, some of this performance was clearly achieved by skipping updates, which could leave affected things like the desktops of VMWare machines feeling jerky. More text based things like GNU Emacs and xterm felt more like I was using them at work (or locally), although generally 'ssh -X' is already pretty good for them and I don't think there was much difference.

Unfortunately, x2go doesn't propagate your local X resource database into the remote X server that all of those remote X programs are creating windows on. Nor does it have a setting to scale up windows by 2x. This meant that remote programs weren't scaled properly for my HiDPI display and came out in tiny size (this normally requires at least some X properties). I was able to fix this by manually copying my X resources over, but the need to do this (manually or with some sort of wrapper automation) makes x2go far less friendly.

As I expected, Firefox's X remote control doesn't work over x2go because there's no 'Firefox' window on x2go's hidden X server (this may also affects the XSettings system). I think that the remote X windows are also oblivious to whether or not they've been iconified on the local X server. In another glitch, my VMWare machines couldn't change the X cursor, although regular remote windows could. And I was unable to find a way to make the overall x2go client window disappear or to automatically start a session; it appears to always require some GUI interactions.

(The x2goclient program is also very chatty to standard output.)

It's clear to me that x2go is not a good replacement for 'ssh -X' for short term disposable windows; there's simply too much fiddling around by comparison. Instead I would probably need to use x2go to establish a persistent launcher program, so that I wasn't constantly fiddling around in the GUI and re-establishing X resources and so on. Once the launcher was running under x2go, I could start up additional remote X programs from it on demand with just a mouse click or whatever, which is much closer to what I'd like.

Given all of the effort required to build and use a reliable and non-annoying x2go environment, along with the limitations and glitches, I currently don't feel that I'll use x2go very much. It would be a different matter if x2go could scale up windows by 2x, because then it would be great for VMWare (where the consoles of virtual machines are unscaled and tiny, although the VMWare GUI elements can be scaled up).

(Also, I couldn't get it to work to an Ubuntu 20.04 server, only to my office Fedora 33 desktop.)

X2goMyExperienceItsOk written at 00:47:47; Add Comment


I wish Prometheus had some features to deal with 'missing' metrics

Prometheus has a reasonable number of features to let you determine things about changes in ongoing metrics. For example, if you want to know how many separate times your Blackbox ICMP pings have started to fail over a time range (as opposed to how frequently they failed), a starting point would be:

changes( probe_success{ probe="icmp" } [1d] )

(The changes() function is not ideal for this; what you would really like is changes_down and changes_up functions.)

But this and similar things only work for metrics (more exactly, time series) that are always present and only have their values change. Many metrics come and go, and right now in Prometheus you can't do changes-like things with them as a result. You can probably get averages over time, but it's at least pretty difficult to get something as simple as a count of how many times an alert fired within a given time interval. As with timestamps for samples, the information necessary is in Prometheus' underlying time series database, but it's not exposed to us.

One starting point would be to expose information that Prometheus already has about time series going stale. As covered in the official documentation on staleness, Prometheus detects most cases of metrics disappearing and puts an explicit marker in the TSDB (although this doesn't handle all cases). But then it doesn't do anything with this marker except not answer queries. Perhaps it would be possible within the existing interfaces to the TSDB to add a count_stale() function that would return a count of how many times a time series for a metric had gone stale within the range.

The flipside is counting or detecting when time series appear. I think this is harder in the current TSDB model, because I don't think there's an explicit marker when a previously not-there time series appears. This means that to know if a time series was new at time X, Prometheus would have to look back up to five minutes (by default) to check for staleness markers and to see if the time series was there. This is possible but would involve more work.

However, I think it's worth finding a solution. It feels frankly embarrassing that Prometheus currently cannot answer basic questions like 'how many times did this alert fire over this time interval'.

(Possibly you can use very clever Prometheus queries with subqueries to get an answer. Subqueries allow you to do a lot of brute force things if you try hard enough, so I can imagine detecting some indirect sign of a just appeared ALERT metric with a subquery.)

PrometheusMissingMetricsWish written at 00:48:51; Add Comment


Prometheus and the case of the stuck metrics

My home desktop can go down, crash, or lock up every so often (for example when it gets too cold). I run Prometheus on it for various reasons, and when this happens I not infrequently wind up looking at graphs of various things (either in Prometheus or in Grafana). Much of the times, these graphs have a weird structure around the time of the crash. The various metrics will be wiggling back and forth as usual before the crash, but then they go flat and just run on in straight lines at some level before they disappear entirely. It took me a while to work out what was going on.

These flat results happen because Prometheus will look backward a certain amount of time in order to find the most recent sample in a time series, by default five minutes. When my machine goes down, no new samples are being written in any time series, so the last pre-crash sample is returned as the 'current' sample for the next five minutes or so, resulting in flat lines (or rate-based things going to zero). Essentially the time series has become stuck at its last recorded value.

If you've rebooted machines you're collecting metrics from or had Prometheus collectors fail, then looked at graphs of the relevant metrics, you may have noticed that you don't see this. This is because Prometheus is smart and has an explicit concept of stale entries. In particular, it will immediately mark time series as stale under the right conditions:

If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, that time series will be marked as stale. If a target is removed, its previously returned time series will be marked as stale soon afterwards.

What this means is that if a target fails to scrape, all time series from it are immediately marked as stale. If another machine goes down or a collector fails, that target scrape will fail (possibly after a bit of a timeout), and all of its time series go away on the spot. Instead of getting stuck time series in your graphs, you get an empty void.

What's special about my home machine is that I'm running Prometheus on the machine itself, and also that the machine crashed (or at least that the Prometheus process was terminated) instead of everything shutting down in an orderly way. When the machine Prometheus is running on just stops abruptly, Prometheus doesn't see any failed targets and it doesn't have a chance to do any cleanup it might normally do in an orderly shutdown. The only way for time series to disappear is through there being no samples in the past five minutes, so for the first few minutes of my home machine being down, I get stuck time series.

(It's not entirely clear to me what Prometheus does here when the main process shuts down properly. I would probably have to pull raw TSDB data with timestamps in order to be sure, and that's too much work right now.)

PrometheusStuckMetrics written at 00:52:18; Add Comment


Dot-separated DNS name components aren't even necessarily subdomains, illustrated

I recently wrote an entry about my pragmatic sysadmin view on subdomains and DNS zones. At the end of the entry I mentioned that we had a case where we had DNS name components that didn't create what I thought of as a subdomain, in the form of the hostnames we assign for the IPMIs of our servers. These names are in the form '<host>.ipmi.core.sandbox' (in one of our internal sandboxes), but I said that 'ipmi.core.sandbox' is neither a separate DNS zone nor something that I consider a subdomain.

There's only one problem with this description; it's wrong. It's been so long since I actually dealt with an IPMI hostname that I mis-remembered our naming scheme for them, which I discovered when I needed to poke at one by hand the other day. Our actual IPMI naming scheme puts the 'ipmi' bit first, giving us host names of the form 'ipmi.<host>.core.sandbox' (as before, for the IPMI for <host>; the host itself doesn't have an interface on the core.sandbox subnet).

What this naming scheme creates is middle name components that clearly don't create subdomains in any meaningful sense. If we have host1, host2, and host3 with IPMIs, we get the following IPMI names:


It's pretty obviously silly to talk about 'host1.core.sandbox' being a subdomain, much more so than 'ipmi.core.sandbox' in my first IPMI naming scheme. These names could as well be 'ipmi-<host>'; we just picked a dot instead of a dash as a separator, and dot has special meaning in host names. The 'ipmi.core.sandbox' version would at least create a namespace in core.sandbox for IPMIs, while this version has no single namespace for them, instead scattering the names all over.

(The technicality here is DNS resolver search paths. You could use 'host1.core.sandbox' as a DNS search path, although it would be silly.)

PS: Tony Finch also wrote about "What is a subdomain?" in an entry that's worth reading, especially for historical and general context.

SubdomainsAndDNSZonesII written at 22:39:58; Add Comment

(Previous 10 or go back to February 2021 at 2021/02/27)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.