Wandering Thoughts


There are multiple uses for metrics (and collecting metrics)

In a comment on my entry on the overhead of the Prometheust host agent's 'perf' collector, a commentator asked a reasonable question:

Not to be annoying, but: is any of the 'perf data' you collect here honestly 'actionable data' ? [...] In my not so humble opinion, you should only collect the type of data that you can actually act on.

It's true that the perf data I might collect isn't actionable data (and thus not actionable metrics), but in my view this is far from the only reason to collect metrics. I can readily see at least three or four different reasons to collect metrics.

The first and obvious purpose is actionable metrics, things that will get you to do things, often by triggering alerts. This can be the metric by itself, such as free disk space on the root of a server (or the expiry time of a TLS certificate), or the metric in combination with other data, such as detecting that the DNS SOA record serial number for one of your DNS zones doesn't match across all of your official DNS servers.

The second reason is to use the metrics to help understand how your systems are behaving; here your systems might be either physical (or at least virtual) servers, or software systems. Often a big reason to look at this information is because something mysterious happened and you want to look at relatively detailed information on what was going on at the time. While you could collect this data only when you're trying to better understand ongoing issues, my view is that you also want to collect it when things are normal so that you have a baseline to compare against.

(And since sometimes things go bad slowly, you want to have a long baseline. We experienced this with our machine room temperatures.)

Sometimes, having 'understanding' metrics available will allow you to head off problems before hand, because metrics that you thought were only going to be for understanding problems as and after they happened can be turned into warning signs of a problem so you can mitigate it. This happened to us when server memory usage information allowed us to recognize and then mitigate a kernel memory leak (there was also a case with SMART drive data).

The third reason is to understand how (and how much) your systems are being used and how that usage is changing over time. This is often most interesting when you look at relatively high level metrics instead of what are effectively low-level metrics from the innards of your systems. One popular sub-field of this is projecting future resource needs, both hardware level things like CPU, RAM, and disk space and larger scale things like the likely future volume of requests and other actions your (software) systems may be called on to handle.

(Both of these two reasons can combine together in exploring casual questions about your systems that are enabled by having metrics available.)

A fourth semi-reason to collect metrics is as an experiment, to see if they're useful or not. You can usually tell what are actionable metrics in advance, but you can't always tell what will be useful for understanding your various systems or understanding how they're used. Sometimes metrics turn out to be uninformative and boring, and sometimes metrics turn out to reveal surprises.

My impression of the modern metrics movement is the general wisdom is to collect everything that isn't too expensive (either to collect or to store), because more data is better than less data and you're usually not sure in advance what's going to be meaningful and useful. You create alerts carefully and to a limited extend (and in modern practice, often focusing on things that people using your services will notice), but for the underlying metrics, the more the potentially better.

MetricsHaveManyUses written at 22:56:24; Add Comment


The trade-offs in not using WireGuard to talk to our cloud server

We recently set up our first cloud server in order to check the external reachability of some of our services, where the cloud server runs a Prometheus Blackbox instance and our Prometheus server talks to it to have it do checks and return the results. Originally, I was planning for there to be a WireGuard tunnel between our Prometheus server and the cloud VM, which Prometheus would use to talk to Blackbox. In the actual realized setup, there's no WireGuard and we use restrictive firewall rules to restrict potentially dangerous access to Blackbox to the Prometheus server.

I had expected to use WireGuard for a combination of access control to Blackbox and to deal with the cloud server having a potentially variable public IP. In practice, this cloud provider gives us a persistent public IP (as far as I can tell from their documentation) and required us to set up firewall rules either way (by default all inbound traffic is blocked), so not doing WireGuard meant a somewhat simpler configuration. Especially, it meant not needing to set up WireGuard on the Prometheus server.

(My plan for WireGuard and the public IP problem was to have the cloud server periodically ping the Prometheus server over WireGuard. This would automatically teach the Prometheus server's WireGuard the current public IP, while the WireGuard internal IP of the cloud server would stay constant. The cloud server's Blackbox would listen only on its internal WireGuard IP, not anything else.)

In some ways the result of relying on a firewall instead of WireGuard is more secure, in that an attacker would have to steal our IP address instead of stealing our WireGuard peer private key. In practice neither are worth worrying about, since all an attacker would get is our Blackbox configuration (and the ability to make assorted Blackbox probes from our cloud VM, which has no special permissions).

The one clear thing we lose in not using WireGuard is that the Prometheus server is now querying Blackbox using unencrypted HTTP over the open Internet. If there is some Intrusion Prevention System (IPS) in the path between us and the cloud server, it may someday decide that it is unhappy with this HTTP traffic (perhaps it trips some detection rule) and that it should block said traffic. An encrypted WireGuard tunnel would hide all of our Prometheus HTTP query traffic (and responses) from any in-path IPS.

(Of course we have alerts that would tell us that we can't talk to the cloud server's Blackbox. But it's better not to have our queries blocked at all.)

There are various ways to work around this, but they all give us a more complicated configuration on at least the cloud server so we aren't doing any of them (yet). And of course we can switch to the WireGuard approach when (if) we have this sort of problem.

CloudVMNoWireGuardTradeoffs written at 23:42:11; Add Comment


Thoughts on (not) automating the setup of our first cloud server

I recently set up our first cloud server, in a flailing way that's probably familiar to anyone who still remembers their first cloud VM (complete with a later discovery of cloud provider 'upsell'). The background for this cloud server is that we want to check external reachability of some of our systems, in addition to the internal reachability already checked by our metrics and monitoring system. The actual implementation of this is quite simple; the cloud server runs an instance of the Prometheus Blackbox agent for service checks, and our Prometheus server performs a subset of our Blackbox service checks through it (in addition to the full set of service checks that are done through our local Blackbox instance).

(Access to the cloud server's Blackbox instance is guarded with firewall rules, because giving access to Blackbox is somewhat risky.)

The proper modern way to set up cloud servers is with some automated provisioning system, so that you wind up with 'cattle' instead of 'pets' (partly because every so often the cloud provider is going to abruptly terminate your server and maybe lose its data). We don't use such an automation system for our existing physical servers, so I opted not to try to learn both a cloud provider's way of doing things and a cloud server automation system at the same time, and set up this cloud server by hand. The good news for us is that the actual setup process for this server is quite simple, since it does so little and reuses our existing Blackbox setup from our main Prometheus server (all of which is stored in our central collection of configuration files and other stuff).

(As a result, this cloud server is installed in a way fairly similar to our other machine build instructions. Since it lives in the cloud and is completely detached from our infrastructure, it doesn't have our standard local setup and customizations.)

In a way this is also the bad news. If this server and its operating environment was more complicated to set up, we would have more motivation to pick one of the cloud server automation systems, learn it, and build our cloud server's configuration in it so we could have, for example, a command line 'rebuild this machine and tell me its new IP' script that we could run as needed. Since rebuilding the machine as needed is so simple and fast, it's probably never going to motivate us into learning a cloud server automation system (at least not by itself, if we had a whole collection of simple cloud VMs we might feel differently, but that's unlikely for various reasons).

Although setting up a new instance of this cloud server is simple enough, it's also not trivial. Doing it by hand means dealing with the cloud vendor's website and going through a bunch of clicking on things to set various settings and options we need. If we had a cloud automation system we knew and already had all set up, it would be better to use it. If we're going to do much more with cloud stuff, I suspect we'll soon want to automate things, both to make us less annoyed at working through websites and to keep everything consistent and visible.

(Also, cloud automation feels like something that I should be learning sooner or later, and now I have a cloud environment I can experiment with. Possibly my very first step should be exploring whatever basic command line tools exist for the particular cloud vendor we're using, since that would save dealing with the web interface in all its annoyance.)

FirstCloudVMAndAutomation written at 22:52:33; Add Comment


Where NS records show up in DNS replies depends on who you ask

Suppose, not hypothetically, that you're trying to check the NS records for a bunch of subdomains to see if one particular DNS server is listed (because it shouldn't be). In DNS, there are two places that have NS records for a subdomain; the nameservers for the subdomain itself (which lists NS records as part of the zone's full data), and the nameservers for the parent domain, which have to tell resolvers what the authoritative DNS servers for the subdomain are. Today I discovered that these two sorts of DNS servers can return NS records in different parts of the DNS reply.

(These parent domain NS records are technically not glue records, although I think they may commonly be called that and DNS people will most likely understand what you mean if you call them 'NS glue records' or the like.)

A DNS server's answer to your query generally has three sections, although not all of them may be present in any particular reply. The answer section contains the 'resource records' that directly answer your query, the 'authority' section contains NS records of the DNS servers for the domain, and the 'additional' section contains potentially helpful additional data, such as the addresses of some of the DNS servers in the authority section. Now, suppose that you ask a DNS server (one that has the data) for the NS records for a (sub)domain.

If you send your NS record query to either a DNS resolver (a DNS server that will make recursive queries of its own to answer your question) or to an authoritative DNS server for the domain, the NS records will show up in the answer section. You asked a (DNS) question and you got an answer, so this is exactly what you'd expect. However, if you send your NS record query to an authoritative server for the parent domain, its reply may not have any NS records in the answer section (in fact the answer section can be empty); instead, the NS records show up in the authority section. This can be surprising if you're only printing the answer section, for example because you're using 'dig +noall +answer' to get compact, grep'able output.

(If the server you send your query to is authoritative for both the parent domain and the subdomain, I believe you get NS records in the answer section and they come from the subdomain's zone records, not any NS records explicitly listed in the parent.)

This makes a certain amount of sense in the DNS mindset once you (I) think about it. The DNS server is authoritative for the parent domain but not for the subdomain you're asking about, so it can't give you an 'answer'; it doesn't know the answer and isn't going to make a recursive query to the subdomain's listed DNS servers. And the parent domain's DNS server may well have a different list of NS records than the subdomain's authoritative DNS servers have. So all the parent domain's DNS server can do is fill in the authority section with the NS records it knows about and send this back to you.

So if you (I) are querying a parent domain authoritative DNS server for NS records, you (I) should remember to use 'dig +noall +authority +answer', not my handy 'cdig' script that does 'dig +noall +answer'. Using the latter will just lead to some head scratching about how the authoritative DNS server for the university's top level domain doesn't seem to want to tell me about its DNS subdomain delegation data.

DNSRepliesWhereNSRecordsShowUp written at 22:08:38; Add Comment


All configuration files should support some form of file inclusion

Over on the Fediverse, I said something:

Every configuration file format should have a general 'include this file' feature, and it should support wildcards (for 'include subdir/*.conf'). Sooner or later people are going to need it, especially if your software gets popular.

It's unfortunate that standard YAML does not support this, although it's also sort of inevitable (YAML doesn't require files at all). This leaves everyone using YAML for their configuration file format to come up with various hacks.

(If this feature is hard-coded, it should use file extensions.)

There are a variety of reasons why people wind up wanting to split up a configuration file into multiple pieces. Obvious ones include that it's easier to coordinate multiple people or things wanting to add settings, a single giant file can be hard to read and deal with, and it's easy to write some parts by hand and automatically generate others. A somewhat less obvious reason is that this makes it easy to disable or re-enable an entire cluster of configuration settings; you can do it by simply renaming or moving around a file, instead of having to comment out a whole block in a giant file and then comment it back in later.

(All of these broadly have to do with operating the software in the large, possibly at scale, possibly packaged by one group of people and used by another. I think this is part of why file inclusion is often not an initial feature in configuration file formats.)

One of the great things about modern (Linux) systems and some modern software is the pervasive use of such 'drop-in' included configuration files (or sub-files, or whatever you want to call them). Pretty much everyone loves them and they've turned out to be very useful for eliminating whole classes of practical problems. Implementing them is not without issues, since you wind up having to decide what to do about clashing configuration directives (usually 'the last one read wins', and then you definite it that files are read in name-sorted order) and often you have to implement some sort of section merging (so that parts of some section can be specified in more than one file). But the benefits are worth it.

As mentioned, one subtle drawback of YAML as a configuration file format is that there's no general, direct YAML feature for 'include a file'. Programs that use YAML have to implement this themselves, by defining schemas that have elements with special file inclusion semantics, such as Prometheus's scrape_config_files: section in its configuration file, which lets you include files of scrape_config directives:

# Scrape config files specifies a list of globs.
# Scrape configs are read from all matching files
# and appended to the list of scrape configs.
  [ - <filepath_glob> ... ]

That this only includes scrape_config directives and not anything else shows some of the limitations of this approach. And since it's not a general YAML feature, general YAML linters and so on won't know to look at these included files.

However, this sort of inclusion is still much better than not having any sort of inclusion at all. Every YAML based configuration file format should support something like it, at least for any configuration section that get large (for example, because it can have lots of repeated elements).

ConfigurationFilesWantIncludes written at 23:17:58; Add Comment


Some thoughts on when you can and can't lower OpenSSH's 'LoginGraceTime'

In a comment on my entry on sshd's 'MaxStartups' setting, Etienne Dechamps mentioned that they lowered LoginGraceTime, which defaults to two minutes (which is rather long). At first I was enthusiastic about making a similar change to lower it here, but then I start thinking it through and now I don't think it's so simple. Instead, I think you can look at three broad situations for the amount of time to log in you give people connecting to your SSH server.

The best case for a quite short login grace time is if everyone connecting authenticates through an already unlocked and ready SSH keypair. If this is the case, the only thing slowing down logins is the need to bounce a certain amount of packets back and forth between the client and you, possibly on slow networks. You're never waiting for people to do something, just for computers to do some calculations and for the traffic to get back and forth. Etienne Dechamps' 20 seconds ought to be long enough for this even under unfavourable network situations and in the face of host load.

(If you do only use keypairs, you can cut off a lot of SSH probes right away by configuring sshd to not even offer password authentication as an option.)

The intermediate case is if people have to unlock their keypair or hardware token, touch their hardware token to confirm key usage, say yes to a SSH agent prompt, or otherwise take manual action that is normally short. In addition to the network and host delays you had with unlocked and ready keypairs, now you have to give fallible people time to notice the need for action and respond to carry it out accurately. Even if 20 seconds is often enough for this, it feels rushed to me and I think you're likely to see some amount of people failing to log in; you really want something longer, although I don't know how much longer.

The worst case is if people authenticate with passwords. Here you have fallible humans carefully typing in their password, getting it wrong (because they have N passwords they've memorized and have to pick the right one, among other things), trying again, and so on. Sometimes this will be a reasonably fast process, much like in the intermediate case, but some of the time it will not be. Setting a mere 20 second timeout on this will definitely cut people off at the knees some of the time. Plus, my view is that you don't want people entering their passwords to feel that they're in a somewhat desperate race against time; that feels like it's going to cause various sorts of mistakes.

For our sins, we have plenty of people who authenticate to us today using passwords. As a result I think we're not in a good position to lower sshd's LoginGraceTime by very much, and so it's probably simpler to leave it at two minutes. Two minutes is fine and generous for people, and it doesn't really cost us anything when dealing with SSH probes (well, once we increase MaxStartups).

OpenSSHLoginGraceTimeThoughts written at 21:48:37; Add Comment


What affects what server host key types OpenSSH will offer to you

Today, for reasons beyond the scope of this entry, I was checking the various SSH host keys that some of our servers were using, by connecting to them and trying to harvest their SSH keys. When I tried this with a CentOS 7 host, I discovered that while I could get it to offer its RSA host key, I could not get it to offer an Ed25519 key. At first I wrote this off as 'well, CentOS 7 is old', but then I noticed that this machine actually had an Ed25519 host key in /etc/ssh, and this evening I did some more digging to try to turn up the answer, which turned out to not be what I expected.

(CentOS 7 apparently didn't used to support Ed25519 keys, but it clearly got updated at some point with support for them.)

So, without further delay and as a note to myself, the SSH host key types a remote OpenSSH server will offer to you are controlled by the intersection of three (or four) things:

  • What host key algorithms your client finds acceptable. With modern versions of OpenSSH you can find out your general list with 'ssh -Q HostKeyAlgorithms', although this may not be the algorithms offered for any particular connection. You can see the offered algorithms with 'ssh -vv <host>', in the 'debug2: host key algorithms' line (well, the first line).

    (You may need to alter this, among other settings, to talk to sufficiently old SSH servers.)

  • What host key algorithms the OpenSSH server has been configured to offer in any 'HostKeyAlgorithms' lines in sshd_config, or some default host key algorithm list if you haven't set this. I think it's relatively uncommon to set this, but on some Linuxes this may be affected by things like system-wide cryptography policies that are somewhat opaque and hard to inspect.

  • What host keys are on the server configured in 'HostKey' directives, in your sshd_config (et al). If you have no HostKey directives, a default set is used. Once you have any HostKey directive, only explicitly listed keys are ever used. Related to this is that the host key files must actually exist and have the proper permissions.

(I believe that you can see the union of the latter two with 'ssh -vv' in the second 'debug2: host key algorithms:' line. I wish ssh would put 'client' and 'server' into these lines.)

This last issue was the problem with this particular CentOS 7 server. Somehow, it had wound up with an /etc/ssh/sshd_config that had explicit HostKey lines but didn't include its Ed25519 key file. It supported Ed25519 fine, but it couldn't offer an Ed25519 key because it didn't have one. Oops, as they say.

(It's possible that this is the result of CentOS 7's post-release addition of Ed25519 keys combined with us customizing this server's /etc/ssh/sshd_config before then, since this server has an sshd_config.rpmnew.)

This also illustrates that your system may generate keys (or have generated keys) for key algorithms it's not going to use. The mere presence of an Ed25519 key in /etc/ssh doesn't mean that it's actually going to get used, or at least used by the server.

Just to be confusing, what SSH key types the OpenSSH ssh program will offer for host-based authentication aren't necessarily the same as what will be offered by the server on the same machine. The OpenSSH ssh doesn't have a 'HostKey' directive and will use any host key it finds using a set of hard-coded names, provided that it's allowed ny the client 'HostKeyAlgorithms' setting. So you can have your ssh client trying to use an Ed25519 or ECDSA host key that will never be offered by the OpenSSH server.

PS: Yes, we still have CentOS 7 machines running, although not for much longer. That was sort of why I was looking at the SSH host keys for this machine.

OpenSSHServerKeyTypes written at 23:17:15; Add Comment


OpenSSH sshd's 'MaxStartups' setting and Internet-accessible machines

Last night, one of our compute servers briefly stopped accepting SSH connections, which set off an alert in our monitoring system. On compute servers, the usual cause for this is that some program (or set of them) has run the system out of memory, but on checking the logs I saw that this wasn't the case. Instead, sshd had logged the following (among other things):

sshd[649662]: error: beginning MaxStartups throttling
sshd[649662]: drop connection #11 from [...]:.. on [...]:22 past MaxStartups

I'm pretty sure I'd seen this error before, but this time I did some reading up on things.

MaxStartups is a sshd configuration setting that controls how many concurrent unauthenticated connections there can be. This can either be a flat number or a setup that triggers random dropping of such connections with a certain probability. According to the manual page (and to comments in the current Ubuntu 22.04 /etc/ssh/sshd_config), the default value is '10:30:100', which drops 30% of new connections if there are already 10 unauthenticated connections and all of them if there are 100 such connections (and a scaled drop probability between those two).

(OpenSSH sshd also can apply a per-'source' limit using PerSourceMaxStartups, where a source can be an individual IPv4 or IPv6 address or a netblock, based on PerSourceNetBlockSize.)

Normal systems probably don't have any issue with this setting and its default, but for our sins some of our systems are exposed to the Internet for SSH logins, and attackers probe them (and these attackers are back in action these days after a pause we noticed in February). Apparently enough attackers were making enough attempts early this morning to trigger this limit. Unfortunately this limit is a global setting, with no way to give internal IPs a higher limit than external ones (MaxStartups is not one of the directives that can be included in Match blocks).

Now that I've looked into this, I think that we may want to increase this setting in our environment. Ten unauthenticated connections is not all that many for an Internet-exposed system that's under constant SSH probes, and our Internet-accessible systems aren't short of resources; they could likely afford a lot more such connections. Our logs suggest we see this periodically across a number of systems, which is more or less what I'd expect if they come from attackers randomly hitting our systems. Probably we want to keep the random drop bit instead of creating a hard wall, but increase the starting point of the random drops to 20 or 30 or so.

(Unfortunately I don't think sshd reports how many concurrent unauthenticated connections it has until it starts dropping them, so you can't see how often you're coming close to the edge.)

OpenSSHMaxStartupsGotcha written at 22:34:43; Add Comment


We have our first significant batch of servers that only have UEFI booting

UEFI has been the official future of x86 PC firmware for a very long time, and for much of that time your machine's UEFI firmware has still been willing to boot your systems the traditional way x86 PCs booted before UEFI, with 'BIOS MBR' (generally using UEFI CSM booting). Some people have no doubt switched to booting their servers with UEFI (booting) years ago, but for various reasons we have long preferred BIOS (MBR) booting and almost always configured our servers that way if given a choice. Over the years we've wound up with a modest number of servers which only supported UEFI booting, but the majority of our servers and especially our generic 1U utility servers all supported BIOS MBR booting.

Well, those days are over now. We're refreshing our stock of generic 1U utility servers and the new generation are UEFI booting only. This is probably not surprising to anyone, as Intel has been making noises about getting rid of UEFI CSM booting for some time, and was apparently targeting 'by 2024' for server platforms. Well, it is 2024 and here we are with new Intel based server hardware without what Intel calls 'legacy boot support'.

(I'm aware we're late to this party, and it's quite possible that server vendors dropped legacy boot mode a year or two ago. We don't buy generic 1U servers very often; we tend to buy them in batches when we have the money and this doesn't happen regularly.)

To be honest, I don't expect UEFI booting to make much of a visible difference in our lives, and it may improve them in some ways (for example if our Linux kernels use UEFI to store crash information). I think we were right to completely avoid the early implementations of UEFI booting, but it ought to work fine by now if server vendors are accepting Intel shoving legacy boot support overboard. There will be new things we'll have to do on servers with mirrored system disks when we replace a failed disk, but Ubuntu's multi-disk UEFI boot story is in decent shape these days and our system disks don't fail that often.

(However, UEFI booting does introduce some new failures modes. We probably won't run into corrupted EFI System Partitions, since their contents don't get changed very often these days.)

FirstUEFIBootOnlyServer written at 22:48:08; Add Comment


Having a machine room can mean having things in your machine room

Today we discovered something:

Apparently our (university) machine room now comes with the bonus of a visiting raccoon. I have nothing against Toronto's charming trash pandas, but I do have a strong preference for them to be outdoors and maybe a bit distant.

(There are so far no signs that the raccoon has decided to be a resident of the machine room. Hopefully it is too cool in the room for it to be interested in that.)

Naturally there is a story here. This past Monday morning (what is now two days ago), we discovered that over the weekend, one of the keyboards we keep sitting around our machine room had been fairly thoroughly smashed, with keycaps knocked off and some scattered some distance around the rack. This was especially alarming because the keyboard (and its associated display) were in our rack of fileservers, which are some of our most critical servers. The keyboard had definitely not been smashed up last Friday, and nothing else seemed to have been disturbed or moved, not even the wires dangling near the keyboard.

Initially we suspected that some contractor had been in the room over the weekend to do work on the air conditioning, wire and fiber runs that go through it (and are partially managed by other people in entirely other groups), or something of that nature, had dropped something on the keyboard, and had decided not to mention it to anyone. Today people poked around the assorted bits of clutter in the corners of the room and discovered increasingly clear evidence of animal presence near our rack of fileservers. The fileserver rack (and the cluttered corner where further evidence was uncovered) are right by a vertical wiring conduit that runs up through the ceiling to higher floors. One speculation is that our (presumed) raccoon was jumping into our fileserver rack in order to climb up to get back into the wiring conduit.

Probably not coincidentally, we had recently had some optical fiber runs between floors suddenly go bad after years of service and with no activity near them that we knew of. One cause we had already been speculating about was animals either directly damaging a fiber strand or bending it enough to cause transmission problems. And in the process of investigating this, last week we'd found out that there was believed to be some degree of animal presence up in the false ceiling of the floor our machine room is on.

We haven't actually seen the miscreant in question, and I hope we don't (trapping it is the job of specialists that the university has already called in). My hope is that the raccoon has decided that our machine room is entirely boring and not worth coming back to, because a raccoon that felt like playing around with the blinking lights and noise-making things could probably do an alarming amount of damage.

(I've always expected that we periodically have mice under the raised floor of our machine room, but the thought of a raccoon is a new one. I'll just consider it a charm of having physical servers in our own modest machine room.)

MachineRoomRaccoon written at 22:09:28; Add Comment

(Previous 10 or go back to April 2024 at 2024/04/28)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.