Wandering Thoughts archives

2019-10-30

Chrony has been working well for us (on Linux, where we use it)

We have a variety of machines around here that run NTP servers, for various reasons. In the beginning they all ran some version of the classic NTP daemon, NTPD, basically because that was your only option and was what everyone provided. Later, OpenBSD changed over to OpenNTPD and so our OpenBSD machines followed along as they were upgraded. Then various Linuxes started switching their default NTP daemon to chrony, and eventually that spread to our usage (first for me personally and then for our servers). These days, when we need to set up a NTP daemon on one of our Ubuntu machines, we reach for chrony. It's what we use on our Ubuntu fileservers and also on an additional machine that we use to provide time to firewalls that are on one of our isolated management subnets.

At the moment this means we have three different NTP daemon implementations running in our environment. An assortment of OpenBSD machines of various versions run various versions of OpenNTPD, a small number of CentOS 7 machines run NTPD version '4.2.6p5' (plus whatever modifications Red Hat has done), and a number of Ubuntu machines run chrony. This has given us some interesting cross comparisons of how all of these work for us in practice, and the quick summary is that chrony is the least troublesome of the three implementations.

Our experience with the CentOS 7 NTPD is that it takes a surprisingly long time after the daemon is started or restarted (including from a system reboot) for the daemon to declare that it has good time. Chrony seems to synchronize faster, or at least be more willing to declare that it has good time (since what we get to see is mostly what chrony reports through SNTP). Chrony also appears to update the system clock the most frequently out of these three NTP implementations, which turns out to sometimes matter for ntpdate.

(I don't want to draw any conclusions from our OpenNTPD experience, since our primary experience is with versions that are many years out of date by now.)

I do mildly wish that Linux distributions could agree on where to put chrony's configuration file; Ubuntu puts it in /etc/chrony, while Fedora just puts it in /etc. But this only affects me, since all of our servers with chrony are Ubuntu (although we may someday get some CentOS 8 servers, which will presumably follow Fedora here).

(Chrony also has the reassuring property that it will retry failed DNS lookups. Normally this is not an issue for us, but we've had two power failures this year where our internal DNS infrastructure wasn't working afterward until various things got fixed. Hopefully this isn't a concern for most people.)

ChronyWorksWell written at 23:20:17; Add Comment

2019-10-29

Netplan's interface naming and issues with it

Back in March, I wrote an entry about our problem with Netplan and routes on Ubuntu 18.04. In a comment on the entry, Trent Lloyd wrote a long and quite detailed reply that covered how netplan actually works here. If you use Netplan to any deep level, it is well worth reading in whole. My short and perhaps inaccurate summary is that Netplan is mostly a configuration translation layer on top of networkd, and its translation is relatively blind and brute force. This straight translation then puts limits on what alterations and matchings you can do, because of how Netplan will translate these to networkd directives and how they will work (or not work).

One of the things that this creates is a confusing interface naming problem. Suppose that you have a standardly created Netplan YAML file that looks like this:

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      addresses: [...]

The eno1 looks like it is an interface name, but it is actually two things at once; it is both a Netplan section name (this is my name for it; Netplan generally calls it a 'device name') and an network interface name. This section will cause Netplan to create a file /run/systemd/network/10-netplan-eno1.network (where eno1 is being used as a section name) that will start out with:

[Match]
Name=eno1

My original problem with routes doesn't actually require us to attach routes to an interface by name, as I thought when I wrote the entry. Instead it requires us to attach routes to a Netplan section by name, and it is just that Ubuntu creates a Netplan configuration where the two are silently the same.

(This split is also part of my confusion over what was possible with netplan wildcards. Netplan wildcards are for matching interface names, not section names. Because of how Netplan creates networkd configuration files and how networkd works, all things that are going to apply to a given interface must have the same section name, as I understand the situation.)

Trent Lloyd ends his comment (except for a parenthetical) by asking:

[...] Perhaps we should look at changing the default configurations to show a functional 'name' so that this kind of task is more obvious to the average user?

I endorse this. I think that it would make things clearer and simpler if there was a visible split in the default configuration between the section name and the interface name, so that my previous example would be:

network:
  version: 2
  renderer: networkd
  ethernets:
    mainif:
      match:
        name: eno1
      addresses: [...]

This is more verbose for a simple case, but that is the YAML bed that Netplan has decided to lie in.

This would make it possible to write generic Netplan rules that applied to your main interface regardless of what it was called, and provide silent guidance for what I now feel are the best practices for any additional interfaces you might later set up.

(Then it would be good to document the merging rules for sections, such as that you absolutely have to use 'mainif:' (or whatever) for everything that you are going to merge together and there is no wildcard matching on that level. In general the Netplan documentation suffers badly from not actually describing what is actually going on; since what is actually going on strongly affects what you can do and what will and won't work, this is a serious issue.)

Another approach would be to allow defining a Netplan level 'section alias', so your section would still be called 'eno1' but it could have the alias of 'mainif', and then any other Netplan configuration for 'mainif' would be folded in to it when Netplan wrote out the networkd configuration files that actually do things.

PS: Since Netplan has two backends, networkd and NetworkManager, your guess is as good as mine for how this would get translated in a NetworkManager based setup. This uncertainty is one of the problems of making Netplan so tightly coupled to its backend in what I will politely call an underdocumented way.

PPS: None of this changes my general opinion of Netplan, which is that I hope it goes away.

NetplanNamingProblem written at 23:00:05; Add Comment

2019-10-23

The DBus daemon and out of memory conditions (and systemd)

We have a number of systems where for reasons beyond the scope of this entry, we enable strict overcommit. In this mode, when you reach the system's memory limits the Linux kernel will deny memory allocations but usually not trigger the OOM killer to terminate processes. It's up to programs to deal with failed memory allocations as best they can, which doesn't always go very well. In our current setup on the most common machines we operate this way, we've set the vm.admin_reserve_kbytes sysctl to reserve enough space for root so that most or all of our system management scripts keep working and we at least don't get deluged in email from cron about jobs failing. This mostly works.

(The sysctl is documented in vm.txt.)

Recently several of these machines hit an interesting failure mode that required rebooting them, even after the memory usage had finished. The problem is DBus, or more specifically the DBus daemon. The direct manifestation of the problem is that dbus-daemon logs an error message:

dbus-daemon[670]: [system] dbus-daemon transaction failed (OOM), sending error to sender inactive

After this error message is logged, attempts to do certain sorts of systemd-related DBus operations hang until they time out (if the software doing them has a timeout). Logins over SSH take quite a while to give you a shell, for example, as they fail to create sessions:

pam_systemd(sshd:session): Failed to create session: Connection timed out

The most relevant problem for us on these machines is that attempts to query metrics from the Prometheus host agent start hanging, likely because we have it set to pull information from systemd and this is done over DBus. Eventually there are enough hung metric probes so that the host agent starts refusing our attempts immediately.

The DBus daemon is not easy to restart (systemd will normally refuse to let you do it directly, for example), so I haven't found any good way of clearing this state. So far my method of recovering a system in this state is to reboot it, which I generally have to do with 'reboot -f' because a plain 'reboot' hangs (it's probably trying to talk to systemd over DBus).

I believe that part of what creates this issue is that the DBus daemon is not protected by vm.admin_reserve_kbytes. That sysctl specifically reserves space for UID 0 processes, but dbus-daemon doesn't run as UID 0; it runs as its own UID (often messagebus), for good security related reasons. As far as I know, there's no way to protect an arbitrary UID through vm.admin_reserve_kbytes; it specifically applies only to processes that hold a relatively powerful Linux security capability, cap_sys_admin. And unified cgroups (cgroup v2) don't have a true guaranteed memory reservation, just a best effort one (and we're using cgroup v1 anyway, which doesn't have anything here).

We're probably making this DBus issue much more likely to happen by having the Prometheus host agent talk to systemd, since this generates DBus traffic every time our Prometheus setup pulls host metrics from the agent (currently, every 15 seconds). At the same time, the systemd information is useful to find services that are dead when they shouldn't be and other problems.

(It would be an improvement if the Prometheus host agent would handle this sort of DBus timeout during queries, but that would only mean we got host metrics back, not that DBus was healthy again.)

PS: For us, all of this is happening on Ubuntu 18.04 with their version of systemd 237 and dbus 1.12.2. However I suspect that this isn't Ubuntu specific. I also doubt that this is systemd specific; I rather suspect that any DBus service using the system bus is potentially affected, and it's just that the most commonly used ones are from systemd and its related services.

(In fact on our Ubuntu 18.04 servers there doesn't seem to be much on the system bus apart from systemd related things, so if there are DBus problems at all, it's going to be experienced with them.)

DBusAndOOM written at 22:31:07; Add Comment

2019-10-19

Ubuntu LTS is (probably) still the best Linux for us and many people

I write a certain amount of unhappy things about Ubuntu here. This is not because I hate Ubuntu, contrary to what it may appear like; I don't write about things that I hate, because I try to think about them as little as possible (cf spam). Ubuntu is a tool for us, and I actually think it is a good tool, which is part of why we use it and keep using it. So today I'm going to write about the attractions of Ubuntu, specifically Ubuntu LTS, for people who want to get stuff done with their servers and for their users without too much fuss and bother (that would be us).

In no particular order:

  • It has a long support period, which reduces churn and the make work of rebuilding and testing a service that is exactly the same except on top of a new OS and a new version of packages. We routinely upgrade many of our machines every other LTS version (which means reinstalling them), which means we get around four years of life out of a given install (I wrote about this years ago here).

    (We have a whole raft of machines that were installed in the summer and fall of 2016, when 16.04 was fresh, and which will be rebuilt in the summer and fall of 2020 on 20.04.)

  • It has a regular and predictable release schedule, which is good for our planning in various ways. This includes figuring out if we want to hold off on building a new service up right now so that we can wait to base it on the next LTS release.

    (This regularity and predictability is one reason our Linux ZFS fileservers are based on Ubuntu instead of CentOS. 18.04 was there at the time, and CentOS 8 was unknown and uncertain.)

  • It has a large collection of packages (which mostly work, despite my grumbling). Building local copies of software is a pain in the rear and we want to do it as little as possible, ideally not at all.

  • It has relatively current software and refreshes its software on a regular basis (every two years, due to the LTS release cadence), which lets us avoid the problems caused by using zombie Linux distributions. This regular refresh is part of the appeal of the regular and predictable release schedule.

  • Since it's popular, it's well supported by software (often along with Debian). For two examples that are relevant to us, Grafana provides .debs and Certbot is available through a PPA.

  • Debian has made a number of good, sysadmin friendly decisions about how to organize configuration files for applications and Ubuntu has inherited them. For example, they have the right approach to Apache configuration.

I don't know of another Linux distribution that has all of these good things, and that includes both Debian and CentOS (despite what I said about Debian only a year ago). CentOS has very long support but not predictable releases and current software, and even with EPEL's improved state it may not have the package selection. Debian has unpredictable releases and a shorter support period.

(As a purely pragmatic matter we're unlikely to switch to something that is simply about as good as Ubuntu, even if it existed. Since switching or using two Linuxes has real costs, the new thing would have to be clearly better. We do use CentOS for some rare machines because the extremely long support period is useful enough for them.)

UbuntuLTSStillBestChoice written at 22:58:39; Add Comment

2019-10-15

The Ubuntu package roulette

Today I got to re-learn a valuable lesson, which is that just because something is packaged in Ubuntu doesn't mean that it actually works. Oh, it's probably not totally broken, but there's absolutely no guarantee that the package will be fully functional or won't contain problems that cause cron to email you errors at least once a day because of an issue that's been known since 2015.

I know the technical reasons for this, which is that Ubuntu pretty much blindly imports packages from Debian and Debian is an anarchy where partially broken packages can rot quietly. Possibly completely non-functional packages can rot too, I don't actually know how Debian handles that sort of situation. Ubuntu's import is mostly blind because Ubuntu doesn't have the people to do any better. This is also where people point out that the package in question is clearly in Ubuntu's universe repository, which the fine documentation euphemistically describes as 'community maintained'.

(I have my opinions on Ubuntu's community nature or lack thereof, but this is not the right entry for that.)

All of this doesn't matter; it is robot logic. What matters is the experience for people who attempt to use Ubuntu packages. Once you enable universe (and you probably will), Ubuntu's command line package management tools don't particularly make it clear where your packages live (not in the way that Fedora's dnf clearly names the repository that every package you install will come from, for example). It's relatively difficult to even see this after the fact for installed packages. The practical result is that an Ubuntu package is an Ubuntu package, and so most random packages are a spin on the roulette wheel with an uncertain bet. Probably it will pay off, but sometimes you lose.

(And then if you gripe about it, some people may show up to tell you that it's your fault for using something from universe. This is not a great experience for people using Ubuntu, either.)

I'm not particularly angry about this particular case; this is why I set up test machines. I'm used to this sort of thing from Ubuntu. I'm just disappointed, and I'm sad that Ubuntu has created a structure that gives people bad experiences every so often.

(And yes, I blame Ubuntu here, not Debian, for reasons beyond the scope of this entry.)

UbuntuPackageRoulette written at 23:25:14; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.