Wandering Thoughts

2021-01-20

A realization about the Linux CPU pressure stall information

Modern versions of the Linux kernel have a set of metrics that are intended to give you a better high-level indicator of where (or why) your system is 'loaded' than the venerable load average, in the form of Pressure Stall Information (also). The top level version of this is exposed as three files in /proc/pressure, called cpu, memory, and io. If your distribution uses cgroup2, each cgroup also has its own version of these that's specific to the cgroup.

(Recent versions of systemd will use this information as part of systemd-oomd and probably other things.)

As covered in the documentation, the io and memory PSI files have two lines, one line for 'some' and one line for 'full', while cpu has only 'some'. The 'some' metric is for when some (one or more) tasks was delayed for lack of the resource, while the 'full' metric is for when all tasks were delayed.

When I first read about all of this, I didn't immediately see why the cpu pressure information only had the 'some' metric; I had to think about it. The answer is that unlike memory and IO, where tasks can be entirely stalled with none of them getting any of the resource yet, there's always some task getting CPU if there's demand for it. A 'full' stall on CPU would require that you have runnable tasks but that nothing was actually being scheduled.

Looking at the total system (for /proc/pressure), it's basically impossible for CPU to stall this way without a serious kernel problem; basically the kernel scheduler would have to be stuck somehow. However, I'm not sure that this is impossible for an individual cgroup; since you can arrange a hierarchy of per-cgroup priorities for CPU time, it wouldn't surprise me if you could completely starve a victim cgroup. Right now the 'cpu.pressure' file for cgroups only has the 'some' metric, just like the /proc/pressure version, but perhaps that will change in the future.

(Linux also has low-priority 'idle' scheduling, as covered in sched(7), so you might be able to manipulate all of the tasks in a cgroup into that so they get starved that way.)

PSICpuWhyNoFull written at 00:38:57; Add Comment

2021-01-14

Understanding WireGuard's AllowedIPs setting (and a bit of tcpdump)

WireGuard is usually described as a VPN, but it's really a secure IP tunnel system with VPNs as the most common use; I've been using WireGuard on Linux for a while. One of the settings you configure for WireGuard peers is AllowedIPs, which back in 2017 I described vaguely as that it '[...] controls which traffic is allowed to flow inside the secure tunnel'. Recently I had an opportunity to discover that my understanding of AllowedIPs was incomplete, so here is a more precise and verbose version.

I'll start by quoting the current wg(8) manpage:

AllowedIPs — a comma-separated list of IP (v4 or v6) addresses with CIDR masks from which incoming traffic for this peer is allowed and to which outgoing traffic for this peer is directed.

As the manual page says, AllowedIPs affects both incoming traffic from your peers and outgoing traffic to your peers. For incoming traffic from a peer, the AllowedIPs setting determines what source IP addresses the traffic can have. Packets from a peer that have an IP source address that's not in the peer's AllowedIPs will be silently dropped by WireGuard.

(WireGuard knows which peer an incoming packet is from because of the cryptography involved.)

For outgoing traffic, the AllowedIPs setting determines which peer a packet will be sent to, based on the packet's destination address. If there is no peer that matches the destination address, the WireGuard interface will reject the packet at the IP level, possibly generating an ICMP message to that effect that it sends to the source IP. If you're using ping on the same machine, it will probably report 'ping: sendmsg: Required key not available'.

How AllowedIPs affects incoming traffic is basically a safety feature; I see it as a form of firewalling. How AllowedIPs affects outgoing traffic is essential, since you can have multiple peers attached to a single WireGuard interface and thus have to pick which peer a given packet will be sent to. I believe that AllowedIPs can't overlap between peers on the same WireGuard interface, but I haven't tested it.

(You can have multiple WireGuard interfaces, each with different peers, and I believe you can duplicate AllowedIPs ranges between peers on different WireGuard interfaces. Getting the right traffic to the right WireGuard interface is up to you; you may need policy based routing or perhaps network namespaces.)

By itself, configuring AllowedIPs for a peer on a particular WireGuard interface doesn't cause the Linux kernel to actually route traffic for that IP address range to the WireGuard interface; where traffic for an IP address range is routed is separate from what peers are configured. You can route traffic to a WireGuard interface without peers configured to handle it, and configure a wider AllowedIPs for a peer than you route traffic to the WireGuard interface.

If you're receiving WireGuard traffic, your AllowedIPs doesn't restrict what destination IP addresses the traffic can have. If your peer configured its peer entry for you with an AllowedIPs of 0.0.0.0/0, it can send you traffic with any random destination IP it feels like. Similarly, if you're sending WireGuard traffic to a peer, that traffic can have any source IP address you want, although the peer will drop it if it's not in their AllowedIPs for you.

If you use 'tcpdump' on a WireGuard interface to test all of this, the traffic you see is conceptually outside WireGuard itself. For outgoing packets, what you see is before WireGuard has picked which peer the packet will be sent to or rejected it because no peer matches. For incoming packets, what you see is after WireGuard has dropped packets for having a source IP that doesn't match the peer's AllowedIPs. This means that outgoing packets may not actually be sent, but incoming packets have definitely been accepted (well, by WireGuard).

(I don't know if there's any easy way to see if WireGuard has dropped some incoming packets because they don't match the peer's AllowedIPs. I suppose you could try to correlate the arrival of encrypted WireGuard packets from a peer's current IP with a lack of packets showing up on the WireGuard interface.)

WireGuardAllowedIPs written at 23:57:24; Add Comment

2021-01-11

How to make Bash fail badly on Ubuntu 16.04 by typo'ing a command name

Here's something I did today, more or less presented in illustrated form, and which I'm glad that I didn't run into before now (since our 16.04 machines don't have long to live):

$ tail -f /var/log/logfile | fgrep uncommon | efgrep -v '(neg1|neg2)'
No command 'efgrep' found, did you mean:
 Command 'vfgrep' from package 'atfs' (universe)
 Command 'dfgrep' from package 'debian-goodies' (main)
 Command 'egrep' from package 'grep' (main)
 Command 'fgrep' from package 'grep' (main)
 Command 'zfgrep' from package 'zutils' (universe)
 Command 'zfgrep' from package 'gzip' (main)
efgrep: command not found
^C^C^\

Congratulations, your shell is now hung unless both your 'tail -f' and your 'fgrep' produce output frequently (if they do, fgrep and then tail will try to write into a closed pipe and then exit). If they don't, for example if the log file doesn't see many updates and most of them don't match the fgrep string, you need to kill the tail to get out from this. This will also happen in simpler cases where you just have the 'tail -f' and the typo'd command.

(When I did this I was in the process of iterating a command line, and I typo'd my change from fgrep to egrep because now I needed to filter out more than one thing.)

The simple thing to say about this is that it only happens on Ubuntu 16.04, not on 18.04 or 20.04, and it happens because Ubuntu's normal /etc/bash.bashrc defines a command_not_found_handle function that winds up running a helper program to produce this 'did you mean' report. The helper program comes from the command-not-found package, which is installed because it's Recommended by ubuntu-standard.

Merely having a command not found handle function doesn't produce this problem even with the 16.04 version of Bash (which is reported as '4.3.48(1)'), because I don't seem to get it if I define the trivial one of:

command_not_found_handle() {
   printf "%s: command not found\n" "$1" 1>&2;
   return 127
}

I don't know if the Ubuntu 16.04 problem is something in how the helper program is implemented, something in Bash, or both combining together. As a result, I can't confidently say that this problem is gone in later versions of Bash, as it may just be patched over by a change in the helper program.

(When the 16.04 Bash is hung, the process listing only has the main Bash shell process, the tail, and the fgrep, so it's not an obvious problem with the helper program where it refuses to exit or something.)

Given the uncertainties around this Ubuntu feature and the general noise it produces when you typo a command (plus that our users can't install packages), I'm sort of tempted to remove the command-not-found package on some or all of our Ubuntu machines. But I'm grumpy right now.

(Removing the package does make the 16.04 Bash work right. It winds up running basically my trivial function above, after some if checks.)

BashNotFoundHang written at 23:36:49; Add Comment

2021-01-01

An interesting and puzzling bit of Linux server utilization

Today, due to looking at dashboards in our Prometheus system supplemented with some use of 'perf top', I discovered that a person here had managed an interesting performance achievement on one of our SLURM CPU nodes. What initially attracted my attention was that the node had sustained 100% CPU usage, but it's the details that make it interesting.

This node has a 1 GB Ethernet interface, and it was fully saturating that interface with NFS reads from one of our fileservers. However, the 100% CPU usage was not from user CPU; instead it was about 97% system time, 2% iowait, 0.2% in softirq, and a tiny fraction in user time (under a tenth of a percent, and I've rounded these). This was already relatively interesting, since it suggests strongly that almost nothing was being done with all of that NFS IO. As kind of expected, there were a ton of interrupts and context switches (although far more than can be explained by plain network IO). The node had a lot of processes running and blocked, running up the load average, so it wasn't just a single process doing all of this.

The 'perf top' output was interesting in a suggestive way; a sample looked like this:

 66.44%  [kernel]    [k] native_queued_spin_lock_slowpath
 12.07%  [kernel]    [k] _raw_spin_lock
  4.28%  [kernel]    [k] list_lru_count_one
  3.36%  [kernel]    [k] isolate_lru_pages.isra.59
  2.64%  [kernel]    [k] putback_inactive_pages
  2.02%  [kernel]    [k] __isolate_lru_page
  1.24%  [kernel]    [k] super_cache_count
  1.09%  [kernel]    [k] shrink_page_list
 [...]

For some reason this workload (whatever it was) was lighting the kernel on fire with spinlock contention. Since there were multiple processes (and 'perf top --sort comm,dso' said that the kernel activity was distributed across a number of them), the obvious theory is that they were contenting over locks related to one or perhaps a few virtual memory areas shared between them. Alternately, perhaps they were all trying to read from the same NFS file and that was causing massive amounts of kernel contention.

(Looking at perf-top(1) suggests that I should have tried 'perf top -g', and perhaps worked out how to get a general record of a few minutes to analyze it in detail. Brendan Gregg's perf page has some examples I should study.)

Now that I've looked at it in the kernel source, isolate_lru_pages has a suggestive comment to the effect that:

For pagecache intensive workloads, this function is the hottest spot in the kernel (apart from copy_*_user functions).

I didn't see any of the copy-to-user calls in the 'perf top' results, but I'm not sure if they're involved on mmap()'d files or if NFS is otherwise somewhat special.

One of the things this illustrates once again is that system level metrics (even detailed ones) don't necessarily tell you underlying causes. They may give you high scale whats ('this system is spending almost all of its time in the kernel'), but without the all important why. It also shows me that I should learn more about Linux performance troubleshooting. Exploring 'perf top' on our various systems has been interesting in general, but I've only scratched the surface of what I can do with perf and I'm probably going to need to know more sooner or later.

(This load has quietly gone away now so I can't dig into it further.)

HighKernelTimePuzzle written at 23:13:24; Add Comment

2020-12-20

Who I think CentOS Stream is and isn't for

As time goes by and Red Hat people write more official posts (via), I've grown some opinions on who it feels that CentOS Stream is for and is not for. Since I also feel that Red Hat people are not being completely straightforward, I am going to write out my views, as an outsider to the CentOS project but someone who uses CentOS (and Ubuntu LTS) and has in the past used RHEL.

If you want to see if your software will work with upcoming package updates for the latest version of Red Hat Enterprise, CentOS Stream is definitely for you. Giving people a preview of RHEL updates is the expressed purpose of CS. You will not necessarily always be able to build software for RHEL using CentOS Stream, because of periodic ABI and API issues, but Red Hat will probably have a solution for that at some point. CentOS Stream is also probably for you if you operate RHEL systems and want some 'canary' ones to see how future package updates are likely to affect you (and if you see any new bugs or issues).

It's quite possible that for an open source project, using CentOS Stream will also be the easiest way of testing if your software (probably) works on the latest version of RHEL (at least in a fully updated state). However, I'm not all that familiar with the options Red Hat has here (such as their 'Universal Base Image'), and Red Hat may introduce more in the future. If you're a commercial company, the CentOS project distro FAQ makes it clear that Red Hat wants you to buy commercial RHEL licenses.

(Since CentOS Stream for 8 will end package updates in mid 2024, open source projects who want to continue testing for RHEL 8 beyond that point are going to have a problem. I expect that at least some open source projects will simply drop official compatibility with RHEL 8 at that point.)

If for some reason you want to contribute to Red Hat Enterprise, CentOS Stream is also officially for you. The most likely case I can imagine is that you use RHEL (or CentOS) and you have a bugfix or small improvement that you want to push for and contribute, one that Red Hat itself seems unlikely to take on. If you intend to make larger contributions, I hope that you have a strong commercial reason for helping a commercial company with its commercial product.

(To me, this raises significant questions about the future of EPEL. However, Red Hat funds Fedora too, so it could decide to work on funding EPEL should a supply of outside volunteers start to dry up.)

If you want a free RPM based Linux distribution that probably has a five-year support period, CentOS Stream may deliver what you're looking for. Officially it's supposed to, at least at the moment, but one can have concerns about the future and as far as I can tell this is not what CentOS Stream is 'supposed' to be for, at least in the eyes of Red Hat (and Red Hat calls the shots). openSUSE Leap is RPM based and free, but it's only supported for 3 years. Fedora, of course, is RPM based and free but only supported for one year, with version to version migration and an expectation that you'll always be on the most current version.

If you want a well supported Linux distribution with an extremely long support period and fast security updates, CentOS was never really for you in the first place and CentOS Stream is worse (because it definitely has no more than five years of package updates and may have longer delays for security updates). In the past you could press CentOS into service as such a thing and ignore, for example, the delay between RHEL security updates and the rebuilt CentOS security updates that corresponded to them, but you were always accepting a compromise. Now that compromise has blown up on you, although so far only for users of CentOS 8.

(CentOS currently says it will continue supporting CentOS 7 through the end of life of RHEL 7 itself. Since Red Hat calls the shots, I am not certain we should be as confident of that as we might have been before.)

If you want a free version of Red Hat Enterprise Linux, the traditional purpose of CentOS, Red Hat more or less says that CentOS Stream is not for you. It is extremely likely that one goal of Red Hat's management is to make it so that no such thing exists any more (and certainly not one that they help pay for). At the moment the closest you can probably come is Oracle Linux, which is also a rebuild of RHEL, but that's if you're willing to put your trust in Oracle (many people are not). It's possible that you can use CentOS Stream for this anyway, but if a lot of people do so I expect Red Hat to take additional steps to discourage this, such as by dropping 'CentOS Stream for 8' support when RHEL 9 comes out (or even before then).

Since 'a free version of RHEL, including its long support period' was in fact what many people used CentOS for, this leaves a significant number of CentOS users without an obvious Linux distribution to use at the moment, although options may well appear in the future.

CentOSStreamWhoFor written at 23:45:14; Add Comment

2020-12-17

Limiting the Nouveau kernel driver's messages via removal

Over on the Fediverse, I said:

Current status: solving software problems triggered by hardware problems with 'rmmod <module>'. It even worked.

(Modules cannot incessantly log kernel messages when they are unloaded. I was just glad the module did unload, given likely broken hardware that it was complaining about.)

Naturally there is a story here.

We have a collection of hand-built AMD Threadripper based compute servers (we bought all the parts, including 4U cases, and assembled them). In order to boot, these machines need video cards, since they don't have on-CPU GPUs and the standard AMD Threadripper motherboards we bought don't come with an onboard GPU the way server motherboards do. So we dug around in the department's collection spare parts collection and came up with a collection of old NVidia cards to stick in these machines.

(Where by 'old' I mean things like Quadro FX 570s, Quadro K420s, a Quadro NVS 285, and even one GeForce 8400 GS, as identified by lspci.)

This morning, after rebooting of of these machines to bring it into service, it began logging hundreds of kernel messages a second from the nouveau driver, to the effect of things like:

nouveau 0000:41:00.0: fifo: PBDMA0: 80000000 [SIGNATURE] ch 1 [007fcf3000 DRM] subc 0 mthd 0000 data 00000000
nouveau 0000:41:00.0: fifo: PBDMA2: 80006000 [GPFIFO GPPTR SIGNATURE] ch 0 [007fcf4000 DRM] subc 0 mthd 0000 data 00000000

This completely overwhelmed the machine (and ran it out of disk space), and didn't do great things to our central syslog server (which got quite busy handling these).

At first I thought that this was yet another case of not ratelimiting kernel messages when you should. It is, but after I was able to reboot the machine through trickery and examine the early kernel messages from the nouveau driver, it turns out to probably also be broken hardware:

nouveau 0000:41:00.0: DRM: DCB conn 00: 00001030
nouveau 0000:41:00.0: DRM: DCB conn 01: 00002146
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
nouveau 0000:41:00.0: disp: chid 0 mthd 0088 data f0000000 00007088 00000000
nouveau 0000:41:00.0: fifo: write fault at 0000001000 engine 04 [BAR1] client 08 [HOST_CPU_NB] reason 00 [PDE] on channel -1 [007fd25000 unknown]
nouveau 0000:41:00.0: fifo: write fault at 0000040000 engine 05 [BAR2] client 08 [HOST_CPU_NB] reason 02 [PTE] on channel -1 [007fd76000 unknown]
nouveau 0000:41:00.0: fifo: DROPPED_MMU_FAULT 00000000
nouveau 0000:41:00.0: fifo: PBDMA2: 80000000 [SIGNATURE] ch 0 [007fcf4000 DRM] subc 7 mthd 1ffc data ffeff7f7
nouveau 0000:41:00.0: fifo: read fault at 0000009000 engine 04 [BAR1] client 07 [HOST_CPU] reason 00 [PDE] on channel -1 [007fd25000 unknown]

That's not looking too healthy in general, and this is old hardware (the machine has one of the Quadro K420s).

I attempted various things to get the nouveau driver to shut off these messages, without success, and then while I was flailing around I had a crazy idea: perhaps I could just rmmod the entire driver. It might leave the machine without much of a video console, but all of these compute servers are effectively headless (they don't normally have a screen plugged in). Somewhat to my surprise, this worked and with the driver unloaded, the messages naturally stopped.

(I was worried about the nouveau driver hanging on unload because it was unable to cleanly shut down the hardware it was talking to, since it was clearly having trouble talking to the hardware in general. And before I checked lsmod I was worried about the driver having a non-zero usage count due to the generic console system or something. It seems a little bit alarming that the kernel driver for your console can have a zero usage count.)

NouveauMessageLimitByRemoval written at 23:05:28; Add Comment

2020-12-12

My views on the suitability of CentOS Stream

In a comment on my most recent entry on CentOS Stream, Ben Cotton said:

I honestly believe that CentOS Stream will be suitable for the majority of CentOS Linux users, and a huge improvement for some. [...]

At one level, I agree with Ben Cotton on this. There's every indication that CentOS Stream won't be worse than plain CentOS 7 as far as bugs and security issues go; while it will now be getting (some) package versions before RHEL does instead of afterward, Red Hat has also apparently drastically increased its pre-release testing of packages. The move from CentOS 8 to CentOS Stream does cost you an extra five years of package updates, but I also feel that you shouldn't run ancient Linux distribution versions so you probably shouldn't be running most CentOS installs for longer than five years anyway.

(I measure these five years from the release of RHEL 8, since what matters is increasingly ancient software versions. And since RHEL freezes package versions well in advance of the actual release, that means that by the end of five years after release the packages are often six or more years out of date. A lot changes in six years.)

So at that level, if you're already running CentOS 8 as a general OS I believe that CentOS Stream will be perfectly fine replacement for it for you and I don't see a strong reason to, say, migrate your existing systems to Ubuntu LTS. There's good indication that CentOS Stream will not create more bugs and instability, while migrating to Ubuntu LTS is both a bunch of work and won't get you much longer of a support period (20.04 LTS support will run out in early 2025, while I believe that CentOS Stream for 8 support will end in late 2024).

Unfortunately, that's only at one level, the level that ignores the risks now in the future. The blunt fact of the matter is that the IBM-ized Red Hat has now shown us that they are willing to drastically change the support period for an existing CentOS product with basically no notice. We have only Red Hat's word that CentOS Stream for 8 support will continue through end of full maintenance for RHEL 8 in late 2024, or actually we don't even have that; Red Hat has made no promises to not change things around again, for example when RHEL 9 is released. Red Hat has made it clear that they decide how this goes and what the CentOS board feels doesn't really matter; the board can at best mitigate the damage (as they apparently did this time around, including getting Red Hat to allow CentOS Stream for 8 to continue longer than Red Hat wanted).

(Red Hat has also made it relatively clear that their only interest in CentOS today is as a way to give people a free preview of what will be in the current RHEL in the future. This neither requires nor rewards supporting and funding CentOS Stream for RHEL 8 after RHEL 9 comes out. It also implicitly encourages things that get in the way of using CentOS Stream as a substitute for RHEL.)

Any commercial company can change direction at the drop of a hat, so Canonical (or SUSE) could also decide to make similar abrupt changes with their Linux distributions (yes, Ubuntu is Canonical's thing, not a community thing, but that's another entry). However, Canonical has not done this so far (instead they've delivered a very consistent experience for over a decade), while Red Hat just has. There's a bigger difference in practice between 'never' and 'once' than there is between 'once' and 'several'.

If I had a CentOS based environment that I had to plan the next iteration of (for example CentOS 7 and I was considering what next), I'm not sure I would build the next iteration on CentOS Stream. It might well be time to start considering alternatives, ones with a longer record of stability in what had been promised and delivered to people. Certainly at this point Ubuntu LTS has a more than a decade record of basically running like clockwork; there are LTS releases every other April, and they get supported for five years from release. There are real limits on the 'support' you get (see also), but at least you know what you're getting and it seems very likely that there won't be abrupt changes in the future.

(Debian doesn't have Canonical's clockwork precision but may give you more or less the same support period and release frequency, but see also. I don't know enough about SUSE to say anything there, but it does use RPM instead of .debs and I like RPMs better. The Debian community is probably the most stable and predictable one; Debian is extremely unlikely to change its fundamental nature after all this time.)

CentOSStreamSuitability written at 23:50:23; Add Comment

2020-12-11

CentOS's switch to CentOS Stream has created a lot of confusion

After the news broke of CentOS's major change in what it is, a number of sysadmins here at the university have been discussing the whole issue. One of the things that has become completely clear to me during these discussions is that the limited ways that this shift has been communicated has created a great deal of confusion, leaving sysadmins with a bunch of reasonable questions around the switch and no clear answers (cf).

(It doesn't help that the current CentOS Stream FAQ is clearly out of date in light of this announcement and contains some contradictory information.)

This confusion matters, because it's affecting people's decisions and is hampering any efforts to leave people feeling good (or at least 'not too unhappy') about this change. If Red Hat and CentOS care about this, they need to fix this, and soon. Their current information is not up to the job and is leaving people lost, unhappy, and increasingly likely to move to something else, even if they might be fine with CentOS Stream if they fully understood it. The longer the confusion goes unaddressed, the more bridges are being burned.

(The limited communication and information also creates a certain sort of impression about how much Red Hat, at least, cares about CentOS users and all of this.)

The points of confusion that I've seen (and had) include what the relationship between updates to CentOS Stream and updates to RHEL will be, how well tested updates in Stream will be, how security issues will be handled (with more clarity and detail than the current FAQ), what happens when a new RHEL release comes out, and whether old versions of packages will be available in Stream so you can revert updates or synchronize systems to old packages. It's possible that some of these are obvious to people in the CentOS project who work with Stream, but they're not obvious to all of the sysadmins who are suddenly being exposed to this. There are probably others; you could probably build up quite a collection by quietly listening to various discussions of this and absorbing the points of confusion and incorrect ideas that people have been left with.

CentOSStreamConfusion written at 00:57:17; Add Comment

2020-12-08

CentOS's switch to Stream is a major change in what CentOS is

The news of the time interval is that CentOS is making a major change in what it is that make it significantly less useful to many people, although their blog entry CentOS Project shifts focus to CentOS Stream will not tell you that. I had some reactions on Twitter and this is an expanded explanation of my views.

What CentOS has been up until this change is people taking the source code for Red Hat Enterprise Linux that Red Hat was releasing (that was happening due to both obligation and cultural factors), rebuilding it to essentially identical binaries (except for trademarks they were obliged to remove and, these days, digital signatures they could not duplicate), and distributing the binaries, installer, and so on for free. When RHEL released updated packages for a RHEL version, CentOS rebuilt them and distributed them, so you could ride along with RHEL updates for as long as RHEL was doing them at all. If you did not have the money to pay for RHEL, this appealed to an overlapping set of two sorts of people, those who wanted to run machines with an extremely long package update period (even if they became zombies) and those who needed to run (commercial) software that worked best or only on RHEL.

(We are both sorts of people, as covered in an older entry about why we have CentOS machines.)

The switch to CentOS Stream makes two major changes to what CentOS is from CentOS 8 onward (CentOS 7 is currently unaffected). First, it shortens the package update period to no more than five years, because package updates for the CentOS Stream version of RHEL <x> stop at the end of RHEL's five year full support period. In practice CentOS Stream for <x> is not likely to be immediately available when RHEL <x> is launched, and you won't install it immediately even if it was, so you will get less than five years of package updates before you must switch or operate machines without someone providing security updates for you.

(It's unclear if there will be a way to upgrade from one version of CentOS Stream to another, or if the answer will be the traditional RHEL one of 'reinstall your machines from scratch'.)

Second, CentOS is no longer what RHEL is except for those required trademark changes. Instead it “tracks just ahead of a current RHEL release”, to quote the blog entry (emphasis theirs), which currently appears to mean that it will contain versions of packages that are not released to RHEL yet. The CentOS distro FAQ is explicit that this will sometimes mean that CentOS Stream has a different ABI and even API than RHEL does, and it's unclear how stable and bug-free those packages will be. If CentOS Stream is intended to be an in-testing preview of RHEL updates, they will probably be less stable and bug-free than RHEL is, and there will be more risk in using CentOS Stream than in using RHEL. But perhaps this is too pessimistic a view. Right now we don't know and the CentOS project is pretty vague and is not making any promises. On the one hand they explicitly say that CentOS Stream will be “serving as the upstream (development) branch of Red Hat Enterprise Linux” (in the blog post); on the other hand they also say that “we expect CentOS Stream to have fewer bugs and more runtime features than RHEL” (in the FAQ in Q5).

(Also, it seems very unlikely that commercial software vendors will conflate the two the way they currently often say 'supported on RHEL/CentOS <x>', although I would expect the software to work on CentOS.)

All of this significantly reduces the advantages of using CentOS over something like Ubuntu LTS and increases the general risks of using CentOS. For a start, CentOS no longer gives us a longer support period than Ubuntu LTS; both are at most five years. Since using additional Linux distributions has a cost all by itself, and since CentOS no longer appears to have significant advantages over Ubuntu LTS for us, I expect that we will be migrating our CentOS 7 machines to Ubuntu 22.04 LTS in a couple of years, thereby reducing the number of different Linux distributions we have to operate.

There is a bit of me that regrets this. CentOS is at the end of a long line of my and our history with Red Hat, and it's sad to see it go. But I guess I now have an answer to my two year old uncertainty over CentOS's future (and I no longer need to write the entry about what things CentOS was our best option for).

PS: It's possible that another distribution will arise that does what CentOS did until now. But I don't know if there is enough interest there any more, or if all of the organizations (and people) who might care enough have moved on.

PPS: Oracle is trying to attract CentOS users, but like plenty of other people, I have no trust in Oracle. We are extremely unlikely to use Oracle Linux instead of Ubuntu LTS, even if we would get (much) longer package updates if all went well; the risks just aren't worth it.

CentOSStreamBigChanges written at 18:54:43; Add Comment

2020-12-06

Linux's hostname -s switch is now safe for many people, but the situation is messy

Slightly over a decade ago I wrote an entry about our discovery that 'hostname -s' sometimes did DNS lookups, depending on the exact version involved. We discovered this the hard way, when our DNS lookups failed at one point and suddenly 'hostname -s' itself started failing unexpectedly. We recently had a reason to use 'hostname -s' again, which caused me to remember this old issue and check the current situation. The good news is that common versions of hostname now don't do DNS lookups.

Well, probably, because it turns out that the Linux situation with hostname is much more complicated and tangled than I had any idea before I started looking. It appears that there are no less than four sources for hostname, and which version you wind up using can depend on your Linux. On top of that, the source you're probably using is distributed in an unusual way that makes it hard for me to say exactly when its 'hostname -s' became safe. So let's start with the basics.

If you check with 'rpm -qf /usr/bin/hostname' or 'dpkg -S /usr/bin/hostname' on appropriate systems (Fedora, CentOS, Debian, and Ubuntu), you will probably find that the version of hostname you're using comes from a 'hostname' package. This package has no upstream as such, and no source repository; the canonical source seems to be the Debian package. Old versions of its source can be found in its part of debsources. This version has handled 'hostname -s' correctly since somewhere between 2.95 (which doesn't) and 3.04 (which does).

(Based on the information shown in its part of debsources, hostname 2.95 was part of Debian 5.0 (Lenny), released in 2009, and hostname 3.04 was part of Debian 6.0 (Squeeze), released in 2011.)

Arch Linux seems to use a hostname that comes from the GNU inetutils project. The relevant code currently appears to do a DNS lookup if you use '-s', but it will proceed if the DNS lookup fails instead of erroring out (the way the decade ago hostname behaved). This does mean that under some conditions your 'hostname -s' command may stall for some time while its DNS lookup times out, instead of running essentially instantly.

The Linux manpages project has two manpages online for hostname (1, 2). The default one is from net-tools, and the other one is from GNU coreutils. The GNU Coreutils version has no '-s' option (or other commonly supported ones), and as a result I would be surprised if many Linuxes used it. The net-tools version is apparently the original upstream of the plain hostname package version. Based on the Fedora 11 bug report about this, back a decade ago Fedora was using the net-tools version of hostname (I don't know about Debian). The current net-tools version of hostname.c now bypasses DNS lookups when used with '-s', a change that was made in 2015.

(While Fedora still packages net-tools, their package only has a few of its binaries. And apparently net-tools as a whole may be basically unmaintained; the last commits in the repository seem to be from 2018, and it was 2016 when it was particularly actively developed.)

HostnameSwitchFine written at 00:44:36; Add Comment

(Previous 10 or go back to November 2020 at 2020/11/26)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.