Wandering Thoughts


Getting Xorg to let you terminate the X server with Ctrl + Alt + Backspace

This is a saga. You can skip to the end for the actual answer if you're impatient.

Yesterday I wrote about the history of terminating the X server with Ctrl + Alt + Backspace. I've known about this feature for a long time, but I only wind up using it very occasionally, even for Cinnamon on my laptop. This infrequent usage explains how I only recently noticed that it had stopped working on my office machine. When I read the Xorg manpage for another reason recently, I stumbled over the current XKB mechanism. I decided to write a little entry about it, which I decided to save for a day when I was extra tired. Then I decided to do some research first, got some surprises, and wrote yesterday's entry instead.

My initial assumption about why C-A-B wasn't working for me was that the Xorg people had switched it off relatively recently (or changed some magic thing in how you had to turn it on). This 2010 SE question and its answers taught me otherwise; the switch had happened a very long time ago, and I was relatively certain that I had used C-A-B since then on my machines. So what had changed?

These days, the X server is mostly configured through configuration file snippets in a directory; on at least Fedora, this is /etc/X11/xorg.conf.d. In my office workstation's directory, I found a 00-keyboard.conf that dated from the start of 2015 and looked like this:

# Read and parsed by systemd-localed.
# It's probably wise not to edit this
# file manually too freely.
Section "InputClass"
   Identifier "system-keyboard"
   MatchIsKeyboard "on"
   Option "XkbLayout" "us"
   Option "XkbModel" "pc105+inet"
   Option "XkbVariant" "terminate:ctrl_alt_bksp,"

I scanned this and said to myself 'well, it's setting the magic XKB option, so something else must be wrong'. I switched to using XKB back at the end of 2015, so at first I though that my setxkbmap usage was overwriting this. However inspection of the manpage told me that I was wrong (the settings are normally merged), and an almost identical 00-keyboard.conf on my home workstation worked with my normal setxkbmap. So yesterday I tiredly posted my history entry and muttered to myself.

This morning, with fresh eyes, I looked at this again and noticed the important thing: this file is setting the XKB keyboard variant, not the XKB options. It should actually be setting "XkbOptions", not "XkbVariant". Since there's no such keyboard variant, this actually did nothing except fool me. I might have noticed the issue if I'd run 'setxkbmap -query', but perhaps not.

All of this leads to the three ways to enable Ctrl + Alt + Backspace termination of the X server, at least on a systemd based system. First, as part of your X session startup you can run setxkbmap to specifically enable C-A-B, among any other XKB changes you're already making:

setxkbmap -option 'compose:rwin' -option 'ctrl:nocaps' -option 'terminate:ctrl_alt_bksp'

Second, you can manually create or edit a configuration file snippet in /etc/X11/xorg.conf.d or your equivalent to specify this. If you already have a 00-keyboard.conf or the equivalent, the option you want is:

Option "XkbOptions" "terminate:ctrl_alt_bksp"

(A trailing comma is okay, apparently.)

Third, if you have Fedora or perhaps any systemd-based distribution, you can configure this the official way by running localectl with a command like this:

localectl --no-convert set-x11-keymap us pc105+inet "" terminate:ctrl_alt_bksp

There is a bear trap lurking here. That innocent looking "" is very important, as covered in the Arch wiki page. As they write (with my emphasis):

To set a model, variant, or options, all preceeding fields need to be specified, but the preceding fields can be skipped by passing an empty string with "". [...]

Given that my original xorg.conf snippet had what should be the XKB options as the XKB variant, it seems very likely that back in January 2015, something ran localectl and left out that all important "".

(That I didn't really notice for a bit more than three years shows some mixture of how little I use C-A-B and how willing I am to shrug and ignore minor mysteries involving keyboards and X.)

My laptop had been set up and maintained as a stock Fedora machine; these days that apparently means that this option isn't enabled in the xorg.conf stuff. Unlike on my workstation (where I edited 00-keyboard.conf directly), I did it the official way through localectl. I determined the other command line parameters by looking at the existing 00-keyboard.conf; I believe that on the laptop, the model (the 'pc105+inet' bit) was blank, as was the variant.

Sidebar: How my machines got to their Xorg keyboard state

I assume that before that early 2015 change, my office workstation's Xorg configuration had the magic XkbOptions setting that made it work. I'm pretty sure that C-A-B worked at some point since 2010 or 2011 or so. My home machine has a 00-keyboard.conf from October 2011, which is about when I installed Fedora on it, with comments that say it was created by system-setup-keyboard, and that has the necessary XkbOptions setting. My office machine's Fedora install dates to 2006, so it might have had any number of configuration oddities that confused things at some point.

(My home machine got a completely new Fedora 15 install in 2011 as part of digging myself out of my Fedora 8 hole. My office workstation never got stuck on an older Fedora release the way my home machine did, so the Fedora install's never been rebuilt from scratch. Sometimes I get vaguely tempted by the idea of a from-scratch rebuild, but then I get terrified of how much picky work it would be just to get back to where I am now.)

XorgBackspaceTerminate written at 23:11:47; Add Comment


A broad overview of how modern Linux systems boot

For reasons beyond the scope of this entry, today I feel like writing down a broad and simplified overview of how modern Linux systems boot. Due to being a sysadmin who has stubbed his toe here repeatedly, I'm going to especially focus on points of failure.

  1. The system loads and starts the basic bootloader somehow, through either BIOS MBR booting or UEFI. This can involve many steps on its own and any number of things can go wrong, such as unsigned UEFI bootloaders on a Secure Boot system. Generally these failures are the most total; the system reports there's nothing to boot, or it repeatedly reboots, or the bootloader aborts with what is generally a cryptic error message.

    On a UEFI system, the bootloader needs to live in the EFI system partition, which is always a FAT32 filesystem. Some people have had luck making this a software RAID mirror with the right superblock format; see the comments on this entry.

  2. The bootloader loads its configuration file and perhaps additional modules from somewhere, usually your /boot but also perhaps your UEFI system partition. Failures here can result in extremely cryptic errors, dropping you into a GRUB shell, or ideally a message saying 'can't find your menu file'. The configuration file location is usually hardcoded, which is sometimes unfortunate if your distribution has picked a bad spot.

    For GRUB, this spot has to be on a filesystem and storage stack that GRUB understands, which is not necessarily the same as what your Linux kernel understands. Fortunately GRUB understands a lot these days, so under normal circumstances you're unlikely to run into this.

    (Some GRUB setups have a two stage configuration file, where the first stage just finds and loads the second one. This allows you more flexibility in where the second stage lives, which can be important on UEFI systems.)

  3. Using your configuration file, the bootloader loads your chosen Linux kernel and an initial ramdisk into memory and transfers control to the kernel. The kernel and initramfs image also need to come from a filesystem that your bootloader understands, but with GRUB the configuration file allows you to be very flexible about how they're found and where they come from (and it doesn't have to be the same place as grub.cfg is, although on a non-UEFI system both are usually in /boot).

    There are two things that can go wrong here; your grub.cfg can have entries for kernels that don't exist any more, or GRUB can fail to locate and bring up the filesystem where the kernel(s) are stored. The latter can happen if, for example, your grub.cfg has the wrong UUIDs for your filesystems. It's possible to patch this up on the fly so you can boot your system.

  4. The kernel starts up, creates PID 1, and runs /init from the initramfs as PID 1. This process and things that it run then flail around doing various things, with the fundamental goal of finding and mounting your real root filesystem and transferring control to it. In the process of doing this it will try to assemble software RAID devices and other storage stuff like LVM, perhaps set sysctls, and so on. The obvious and traditional failure mode here is that the initramfs can't find or mount your root filesystem for some reason; this usually winds up dropping you into some sort of very minimal rescue shell. If this happens to you, you may want to boot from a USB live image instead; they tend to have more tools and a better environment.

    (Sometimes the reasons for failure are obscure and annoying.)

    On many traditional systems, the initramfs /init was its own separate thing, often a shell script, and was thus independent from and different from your system's real init. On systemd based systems, the initramfs /init is actually systemd itself and so even very early initramfs boot is under systemd's control. In general, a modern initramfs is a real (root) filesystem that processes in the initramfs will see as /, and its contents (both configuration files and programs) are usually copied from the versions in your root filesystem. You can inspect the whole thing with lsinitrd or lsinitramfs.

    Update: It turns out that the initramfs init is still a shell script in some Linux distributions, prominently Debian and Ubuntu. The initramfs init being systemd may be a Red Hat-ism (Fedora and RHEL). Thanks to Ben Hutchings in the comments for the correction.

    How the initramfs /init pivots into running your real system's init daemon on your real system's root filesystem is beyond the scope of this entry. The commands may be simple (systemd just runs 'systemctl switch-root'), but how they work is complicated.

    (That systemd is the initramfs /init is convenient in a way, because it means that you don't need to learn an additional system to inspect how your initramfs works; instead you can just look at the systemd units included in the initramfs and follow along in the systemd log.)

  5. Your real init system starts up to perform basic system setup to bring the system to a state that we think of as the normal basic way it is; basically, this is everything you usually get if you boot into a modern single user mode. This does things like set the hostname, mount the root filesystem so it can be written to, apply your sysctl settings (from the real root filesystem this time), configure enough networking so that you have a loopback device and the IPv4 and IPv6 localhost addresses, have udev fiddle around with hardware, and especially mount all of your local filesystems (which includes activating underlying storage systems like software RAID and LVM, if they haven't been activated already in the initramfs).

    The traditional thing that fails here is that one or more of your local filesystems can't be mounted. This often causes this process to abort and drop you into a single user rescue shell environment.

    (On a systemd system the hostname is actually set twice, once in the initramfs and then again in this stage.)

  6. With your local filesystems mounted and other core configuration in place, your init system continues on to boot your system the rest of the way. This does things like configure your network (well, perhaps; these days some systems may defer it until you log in), start all of the system's daemons, and eventually enable logins on text consoles and perhaps start a graphical login environment like GDM or LightDM. At the end of this process, your system is fully booted.

    Things that fail here are problems like a daemon not starting or, more seriously, the system not finding the network devices it expects and so not getting itself on the network at all. Usually the end result is that you still wind up with a login prompt (either a text console or graphics), it's just that there were error messages (which you may not have seen) or some things aren't working. Very few modern systems abort the boot and drop into a rescue environment for failures during this stage.

    On a systemd system, this transfers control from the initramfs systemd to the systemd binary on your root filesystem (which takes over as PID 1), but systemd maintains continuity of its state and boot process and you can see the whole thing in journalctl. The point where the switch happens is reported as 'Starting Switch Root...' and then 'Switching root.'

All of System V init, Upstart, and systemd have this distinction between the basic system setup steps and the later 'full booting' steps, but they implement it in different ways. Systemd doesn't draw a hard distinction between the two phases and you can shim your own steps into either portion in basically the same way. System V init tended to implement the early 'single user' stage as a separate nominal runlevel, runlevel 'S', that the system transitioned through on the way to its real target runlevel. Upstart is sort of a hybrid; it has a startup event that's emitted to trigger a number of things before things start fully booting.

(This really is an overview. Booting Linux on PC hardware has become a complicated process at the best of times, with a lot of things to set up and fiddle around with.)

LinuxBootOverview written at 00:17:31; Add Comment


The mess Ubuntu 18.04 LTS has made of libreadline

I'll start with my tweet:

I see that Ubuntu still hasn't provided libreadline6 for Ubuntu 18.04 LTS, despite that being the default and best readline to compile against on both 16.04 LTS and 14.04 LTS. Binaries that work across LTS versions? Evidently we can't have that.

Even the new expanded Twitter is too little space to really explain things for people who don't already have some idea of what I'm talking about, so let's expand that out.

Let's suppose that you've built yourself a program that uses GNU Readline on an Ubuntu 14.04 or Ubuntu 16.04 machine (perhaps a custom shell). You have a mixed environment, with common binaries used across multiple hosts (for example, because you have NFS fileservers). When you try to run this program on Ubuntu 18.04, here is what will happen:

<program>: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory

What is going on here (besides there being no libreadline.so.6 on Ubuntu 18.04) is that shared libraries on Linux have a version, which is the digit you see after the .so, and programs are linked against a specific version of each shared library. This shared library version number changes if the library's ABI changes so that it wouldn't be safe for old programs to call the new version of the library (for example because a structure changed size).

The standard Ubuntu (and Debian) naming scheme for shared library packages is that 'libreadline6' is the Ubuntu package for version 6 of libreadline (ie for libreadline.so.6). So you would normally fix this problem on Ubuntu 18.04 by installing the 18.04 version of libreadline6. Unfortunately no such package exists, which is the problem.

Over the years, Ubuntu has packaged various versions of GNU Readline. Of the still supported LTS releases, 14.04 and 16.04 ship libreadline5 and libreadline6, while Ubuntu 18.04 ships libreadline5 and libreadline7 and does not have libreadline6. So, can you build a program on 14.04 or 16.04 so that it uses libreadline5, which would perhaps let you run it on 18.04? Unfortunately you can't even do this, as 14.04 and 16.04 only let you build programs that use libreadline6.

(The libreadline-dev package on 14.04 and 16.04 installs libreadline6-dev, and there is no libreadline5-dev.)

The reason Debian and Ubuntu package multiple versions of GNU Readline is for backwards compatibility, so that programs compiled on older versions of the distribution, using older versions of the shared library, will still run on the new version. That's why libreadline5 is packaged even on 18.04. But this time around, Ubuntu apparently decided to throw GNU Readline backwards compatibility to 14.04 and 16.04 under the bus for some reason, or perhaps they just didn't bother to check and notice despite the fact that this should be a routine check when putting an Ubuntu release together (especially an LTS one).

If you're in this situation, the good news is that there is a simple manual fix. Just download a suitable binary .deb of libreadline6 by hand (for example from the 16.04 package) and install it on your 18.04 system. This appears to work fine and hasn't blown up on us yet. If you have a lot of 18.04 systems to install, you probably want to add this to your install automation. Perhaps someday you'll be able to take it out in favour of installing the official Ubuntu 18.04 version of libreadline6, but based on the current state of affairs I wouldn't hold my breath about that.

(There are various standard Ubuntu programs that use GNU Readline, such as /usr/bin/ftp. However, they're all specific to the particular Ubuntu release and so they all use its GNU Readline version, whatever that is; on 18.04, they all use libreadline7. Should you copy the 16.04 ftp binary over to an 18.04 machine you'd have this problem with it too, but there's very little reason to do that; 18.04 comes with its own version of ftp, after all.)

PS: Ubuntu's Launchpad.net is such a mess that I can't even tell if this has been reported as a bug. Oh, apparently this is probably the right page, and also it looks like no such bug has been filed. It's sad that I could only find it by the 'Bug Reports' link on the packages.ubuntu.com page for it.

Ubuntu1804ReadlineMess written at 00:56:39; Add Comment


Taking over program names in Linux is generally hard

One reaction to the situation with net-tools versus iproute2, where the Linux code for ifconfig, netstat, and so on is using old and incomplete interfaces and is basically unmaintained, is that the new and actively maintained iproute2 should provide its own reimplementations of ifconfig, netstat, and so on that preserve the interface (or as much of it as possible) while using modern mechanisms. Setting aside the question of whether the people developing iproute2 even like the ifconfig interface and are willing to spend their time writing a version of it, there are additional difficulties in doing this kind of name takeover in Linux.

The core problem is that existing Linux distributions and existing systems will already have those programs provided from a completely different package. This generally has two effects. First, some Linux distributions will disagree with what you're doing and want to keep providing those programs from the other package, which means that the upstream package has to be able to build and install things without its version of the programs it's theoretically trying to take over (ie, the new release of iproute2 has to be able to build without its version of ifconfig et al).

Second, when distributions decide that they trust and prefer your versions of the programs better than the old ones, they have to be able to do some sort of package upgrade or migration that replaces the other package with a version of your package that has your version of the programs included. There are also inevitably going to be distributions that will want to give users a choice of which version of the programs to install, which means that some of the time the distribution will actually build two binary packages for your package, one with your core tools ('iproute2') and one with your replacements for the other package's programs (a hypothetical 'iproute2-nettools', that has to cleanly replace 'net-tools').

Some of this work has to be done by the developers of the new package; they have to make replacement programs that are compatible enough that users won't complain, and then they have to make it possible to not build these programs or build them but not install them. Other portions of the work have to be done by distributions, who have to package all of this up, make sure that they don't accidentally create package conflicts, make sure package upgrades will work well and won't blow up dependencies, and so on. Since this complicates the lives of distributions and the people preparing packages, it's not something that they're likely to undertake casually. In fact, distributions are probably not likely to undertake it at all unless the developers of the new package actively try to push for it, or unless (and until) the programs in the old package become clearly broken and basically force themselves to be replaced.

(I'm generously assuming here that the old package is truly abandoned and everyone agrees that it has to go sometime. If there are people who want it to stay, you have additional problems.)

All of this is the consequence of there being multiple Linux distributions that will make different decisions and that Linux distributions are developed independently from each other and from the upstream packages. If everything was handled by a single group of developers, such takeovers would have much less to worry about and to coordinate (and you wouldn't have packaging work being done over and over again in different packaging systems).

TakingOverNamesHard written at 01:44:39; Add Comment


There's real reasons for Linux to replace ifconfig, netstat, et al

One of the ongoing system administration controversies in Linux is that there is an ongoing effort to obsolete the old, cross-Unix standard network administration and diagnosis commands of ifconfig, netstat and the like and replace them with fresh new Linux specific things like ss and the ip suite. Old sysadmins are generally grumpy about this; they consider it yet another sign of Linux's 'not invented here' attitude that sees Linux breaking from well-established Unix norms to go its own way. Although I'm an old sysadmin myself, I don't have this reaction. Instead, I think that it might be both sensible and honest for Linux to go off in this direction. There are two reasons for this, one ostensible and one subtle.

The ostensible surface issue is that the current code for netstat, ifconfig, and so on operates in an inefficient way. Per various people, netstat et al operate by reading various files in /proc, and doing this is not the most efficient thing in the world (either on the kernel side or on netstat's side). You won't notice this on a small system, but apparently there are real impacts on large ones. Modern commands like ss and ip use Linux's netlink sockets, which are much more efficient. In theory netstat, ifconfig, and company could be rewritten to use netlink too; in practice this doesn't seem to have happened and there may be political issues involving different groups of developers with different opinions on which way to go.

(Netstat and ifconfig are part of net-tools, while ss and ip are part of iproute2.)

However, the deeper issue is the interface that netstat, ifconfig, and company present to users. In practice, these commands are caught between two masters. On the one hand, the information the tools present and the questions they let us ask are deeply intertwined with how the kernel itself does networking, and in general the tools are very much supposed to report the kernel's reality. On the other hand, the users expect netstat, ifconfig and so on to have their traditional interface (in terms of output, command line arguments, and so on); any number of scripts and tools fish things out of ifconfig output, for example. As the Linux kernel has changed how it does networking, this has presented things like ifconfig with a deep conflict; their traditional output is no longer necessarily an accurate representation of reality.

For instance, here is ifconfig output for a network interface on one of my machines:

 ; ifconfig -a
 em0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 128.100.3.XX  netmask  broadcast
    inet6 fe80::6245:cbff:fea0:e8dd  prefixlen 64  scopeid 0x20<link>
    ether 60:45:cb:a0:e8:dd  txqueuelen 1000  (Ethernet)

There are no other 'em0:...' devices reported by ifconfig, which is unfortunate because this output from ifconfig is not really an accurate picture of reality:

; ip -4 addr show em0
  inet 128.100.3.XX/24 brd scope global em0
    valid_lft forever preferred_lft forever
  inet 128.100.3.YY/24 brd scope global secondary em0
    valid_lft forever preferred_lft forever

This interface has an IP alias, set up through systemd's networkd. Perhaps there once was a day when all IP aliases on Linux had to be set up through additional alias interfaces, which ifconfig would show, but these days each interface can have multiple IPs and directly setting them this way is the modern approach.

This issue presents programs like ifconfig with an unappealing choice. They can maintain their traditional output, which is now sometimes a lie but which keeps people's scripts working, or they can change the output to better match reality and probably break some scripts. It's likely to be the case that the more they change their output (and arguments and so on) to match the kernel's current reality, the more they will break scripts and tools built on top of them. And some people will argue that those scripts and tools that would break are already broken, just differently; if you're parsing ifconfig output on my machine to generate a list of all of the local IP addresses, you're already wrong.

(If you try to keep the current interface while lying as little as possible, you wind up having arguments about what to lie about and how. If you can only list one IPv4 address per interface in ifconfig, how do you decide which one?)

In a sense, deprecating programs like ifconfig and netstat that have wound up with interfaces that are inaccurate but hard to change is the honest approach. Their interfaces can't be fixed without significant amounts of pain and they still work okay for many systems, so just let them be while encouraging people to switch to other tools that can be more honest.

(This elaborates on an old tweet of mine.)

PS: I believe that the kernel interfaces that ifconfig and so on currently use to get this information are bound by backwards compatibility issues themselves, so getting ifconfig to even know that it was being inaccurate here would probably take code changes.

ReplacingNetstatNotBad written at 01:31:08; Add Comment


I'm worried about Wayland but there's not much I can do about it

In a comment on my entry about how I have a boring desktop, Opk asked a very good question:

Does it concern you at all that Wayland may force change on you? It may be a good few years away yet and perhaps fvwm will be ported.

Oh my yes, I'm definitely worried about this (and it turns out that I have been for quite some time, which also goes to show how long Wayland has been slowly moving forward). The FVWM people have said that they're not going to try to write a version of Wayland, which means that when Wayland inevitably takes over I'm going to need a new 'window manager' (in Wayland this is a lot more than just what it is in X) and possibly an entirely new desktop environment to go with it.

The good news is that apparently XWayland provides a reasonably good way to let X programs still display on a Wayland server, so I won't be forced to abandon as many X things as I expected. I may even be able to continue to run remote X programs via SSH and XWayland, which is important for my work desktop. This X to Wayland bridge will mean that I can keep not just programs with no Wayland equivalent but also old favorites like xterm, where I simply don't want to use what will be the Wayland equivalent (I don't like gnome-terminal or konsole very much).

The bad news for me is two-fold. First, I'm not attracted to tiling window managers at all, and since tiling window managers are the in thing, they're the most common alternate window managers for Wayland (based on various things, such as the Arch list). There seems to be a paucity of traditional stacking Wayland WMs that are as configurable as fvwm is, although perhaps there will be alternate methods in Wayland to do things like have keyboard and mouse bindings. It's possible that this will change when Wayland starts becoming more dominant, but I'm not holding my breath; heavily customized Linux desktop environments have been feeling more and more like extreme outliers over the years.

Second, it seems at least reasonably likely that a lot of current tray applets and notification systems will stop being general and start becoming tightly bound to mainstream desktop environments like Gnome 3, KDE, and Cinnamon. We've already seen this with Gnome 3 and Cinnamon, which have 'applets' that are now JavaScript extensions that run in the context of the Gnome and Cinnamon shells and simply can't be used outside them. In a Wayland world that focuses attention more than ever on a few mainstream desktop environments, will there be any equivalent of stalonetray and things for it like pnmixer?

(The people writing tiling Wayland window managers like Sway will probably certainly want there to be, because it will be hard to have a viable alternate environment without them. The question is whether major projects like NetworkManager will oblige or whether NM will use its limited development resources elsewhere.)

So yes, I worry about all of this. But in practice it's a very abstracted worry. To start with, Wayland is still not really here yet. Fedora is using it more, but it's by no means universal even for Gnome (where it's the default), and I believe that KDE (and other supported desktop environments) don't even really try to use it. At this rate it will be years and years before anyone is seriously talking about abandoning X (since Gnome programs will still face pressure to be usable in KDE, Cinnamon, and other desktop environments that haven't yet switched to Wayland).

(I believe that Fedora is out ahead of other other Linux distributions, too. People like Debian will probably be trying to support X and pressure people to support X for years to come.)

More significantly, there's nothing I can do about all of this. How Wayland in general and Wayland environments develop is far beyond my ability to influence; in practice I'm a far outlier in window manager and desktop land, and so I'll have to make do with whatever is available. If I'm lucky it will be something generally comparable to my current environment; if I'm not, well, I can use Cinnamon and it will probably survive in a Wayland-only world. I might even learn enough Cinnamon shell and JavaScript to customize it a bit.

(If I had a lot of energy and enthusiasm, perhaps I would be trying to write the stacking, construction kit style Wayland window manager and compositor of my dreams. I don't have anything like that energy. I do hope other people do, and while I'm hoping I hope that they like textual icon managers as much as I do.)

WaylandWorries written at 01:33:05; Add Comment


How you run out of inodes on an extN filesystem (on Linux)

I've mentioned that we ran out of inodes on a Linux server and covered what the high level problem was, but I've never described the actual mechanics of how and why you can run out of inodes on a filesystem, or more specifically on an extN filesystem. I have to be specific about the filesystem type, because how this is handled varies from filesystem to filesystem; some either have no limit on how many inodes you can have or have such a high limit that you're extremely unlikely to run into it.

The fundamental reason you can run out of inodes on an extN filesystem is that extN statically allocates space for inodes; in every extN filesystem, there is space for so many inodes reserved, and you can never have any more than this. If you use 'df -i' on an extN filesystem, you can see this number for the filesystem, and you can also see it with dumpe2fs, which will tell you other important information. Here, let's look at an ext4 filesystem:

# dumpe2fs -h /dev/md10
Block size:               4096
Blocks per group:         32768
Inodes per group:         8192

I'm showing this information because it leads to the important parameter for how many inodes any particular extN filesystem has, which is the bytes/inode ratio (mke2fs's -i argument). By default this is 16 KB, ie there will be one inode for every 16 KB of space in the filesystem, and as the mke2fs manpage covers, it's not too sensible to set it below 4 KB (the usual extN block size).

The existence of the bytes/inode ratio gives us a straightforward answer for how you can run a filesystem out of inodes: you simply create lots of files that are smaller than this ratio. ExtN implicitly assumes that each inode will on average use at least 16 KB of disk space; if on average your inodes use less, you will run out of inodes before you run out of disk space. One tricky thing here is that this space doesn't have to be used up by regular files, because other sorts of inodes can be small too. Probably the easiest other source is directories; if you have lots of directories with a relatively small number of subdirectories and files in each, it's quite possible for many of them to be smaller than 16 KB, and in some cases you can have a great many subdirectories.

(In our problem directory hierarchy, almost all of the directories are 4 KB, although a few are significantly larger. And the hierarchy can have a lot of subdirectories when things go wrong.)

Another case is symbolic links. Most symbolic links are quite small, and in fact ext4 may be able to store your symbolic link entirely in the inode itself. This means that you can potentially use up a lot of inodes without using any disk space (well, beyond the space for the directories that the symbolic links are in). There are other sorts of special files that also use little or no disk space, but you probably don't have tons of them in an extN filesystem unless something unusual is going on.

(If you do have tens of thousands of Unix sockets or FIFOs or device files, though, you might want to watch out. Or even tons of zero-length regular files that you're using as flags and a persistence mechanism.)

Most people will never run into this on most filesystems, because most filesystems have an average inode size usage that's well above 16 KB. There usually plenty of files over 16 Kb, not that many symbolic links, and a relatively few (small) directories compared to the regular files. For instance, one of my relatively ordinary Fedora root filesystem has a bytes/inode ratio of roughly 73 Kb per inode, and another is at 41 KB per inode.

(You can work out your filesystem's bytes/inode ratio simply by dividing the space used in KB by the number of inodes used.)

HowInodesRunOut written at 01:10:42; Add Comment


ZFS on Linux's development version now has much better pool recovery for damaged pools

Back in March, I wrote about how much better ZFS pool recovery was coming, along with what turned out to be some additional exciting features, such as the long-awaited feature of shrinking ZFS pools by removing vdevs. The good news for people using ZFS on Linux is that most of both features have very recently made it into the ZFS on Linux development source tree. This is especially relevant and important if you have a damaged ZFS on Linux pool that either doesn't import or panics your system when you do import it.

(These changes are OpenZFS 9075 and its dependencies such as OpenZFS 8961, and the vdev removal changes, although there are followup fixes to them such as OpenZFS 9290.)

These changes aren't yet in any ZFS on Linux release and I suspect that they won't appear until 0.8.0 is released someday (ie, they won't be ported into the current 0.7.x release branch). However, it's fairly easy to build ZFS on Linux from source if you need to temporarily run the latest version in order to recover or copy data out of a damaged pool that you can't otherwise get at. I believe that some pool recovery can be done as a one-time import and then you can revert back to a released version of ZFS on Linux to use the now-recovered pool, but certainly not all pool import problems can be repaired like this.

(As far as vdev removal goes, it currently requires permanently using a version of ZFS that supports it, because it adds a device_removal feature to your pool that will never deactivate, per zpool-features. This may change at some point in the future, but I wouldn't hold my breath. It seems miraculous enough that we've gotten vdev removal after all of these years, even if it's only for single devices and mirror vdevs.)

I haven't tried out either of these features, but I am running a recently built development version of ZFS on Linux with them included and nothing has exploded so far. As far as things go in general, ZFS on Linux has a fairly large test suite and these changes added tests along with their code. And of course they've been tested upstream and OmniOS CE had enough confidence in them to incorporate them.

ZFSOnLinuxBetterPoolImport written at 22:26:45; Add Comment


How we're going to be doing custom NFS mount authorization on Linux

We have a long standing system of custom NFS mount authorization on our current OmniOS-based fileservers. This system has been working reliably for years, but our next generation of fileservers will use a different OS, almost certainly Linux, and our current approach doesn't work on Linux, so we had to develop a new one.

One of the big attributes of our current system is that it doesn't require the clients to do anything special; they do NFS mount requests or NFS activity, and provided that their SSH daemon is running, they get automatically checked and authorized. This is important to making the system completely reliable, which is very important if we're going to use it for our own machines (which are absolutely dependent on NFS working). However, the goals of our NFS authorization have shifted so that we no longer require this for our own machines. In light of that, we decided to adopt a more straightforward approach on Linux, one that requires client machines to explicitly do a manual step on boot before they could get NFS access.

The overall 'authorization' system works via firewall rules, where only machines in a particular ipset table can talk to the NFS ports on the fileserver. Control over actual NFS mounts and NFS level access is still done through exportfs and so on, but you have to be in the ipset table in order to even get that far. To get authorized, ie to get added to the ipset table, your client machine makes a connection to a specific TCP port on the fileserver. This ends up causing a Go program to make a connection to the SSH server on the client machine and verify its host key against a known_hosts file that we maintain; if the key verifies, we add the client's IP address to the ipset table, and if it fails to verify, we explicitly remove the client's IP address from the table.

(This connection can be done as simply as 'nc FILESERVER PORT </dev/null >/dev/null'. In practice clients may want to record the output from the port, because we spit out status messages, including potentially important ones about why a machine failed verification. We syslog them too, but those syslog logs aren't accessible to other people.)

This Go program can actually check and handle multiple IP addresses at once (doing so in parallel). In this mode, it runs from cron every few minutes to re-verify all of the currently authorized hosts. The program is sufficiently fast that it can complete this full re-verification in under a second (and with negligible resource usage); in practice, the speed limit is how long of a timeout we use to wait for machines to respond.

To handle fileserver reboots, verified IPs are persistently recorded by touching a file (with the name of their IP address) in a magic directory. On boot and on re-verification, we merge all of the IPs from this directory with the IPs from the ipset table and verify them all. Any IPs that pass verification but aren't in the ipset table are added back to the table (and any IPs in the ipset table but not recorded on disk are persisted to disk), which means that on boot all IPs will be re-added to the ipset table without the client having to do anything.

Clients theoretically don't have to do anything once they've booted and been authorized, but because things can always go wrong we're going to recommend that they re-poke the magic TCP port every so often from cron, perhaps every five or ten minutes. That will insure that any NFS outage should have a limited duration and thus hopefully a limited impact.

(In theory the parallel Go checker is so fast that we could just extract all of the client IPs from our known_hosts and always try to verify them, say, once every fifteen minutes. In practice I think we're unlikely to do this because there are various potential issues and it's probably unlikely to help much in practice.)

We're probably going to provide people with a little Python program that automatically does the client side of the verification for all current NFS mounts and all mounts in /etc/fstab, and then logs the results and so on. This seems more friendly than asking all of the people involved to write their own set of scripts or commands for this.

PS: Our own machines on trusted subnets are handled by just having a blanket allow rule in the firewall for those subnets. You only have to be in the ipset table if you're not on one of those subnets.

CustomMountAuthorizationII written at 00:34:33; Add Comment


You probably need to think about how to handle core dumps on modern Linux servers

Once upon a time, life was simple. If and when your programs hit fatal problems, they generally dumped core in their current directory under the name core (sometimes you could make them be core.<PID>). You might or might not ever notice these core files, and some of the time they might not get written at all because of various permissions issues (see the core(5) manpage). Then complications ensued due to things like Apport, ABRT, and systemd-coredump, where an increasing number of Linux distributions have decided to take advantage of the full power of the kernel.core_pattern sysctl to capture core dumps themselves.

(The Ubuntu Apport documentation claims that it's disabled by default on 'stable' releases. This does not appear to be true any more.)

In a perfect world, systems like Apport would capture core dumps from system programs for themselves and arrange that everything else was handled in the traditional way, by writing a core file. Unfortunately this is not a perfect world. In this world, systems like Apport almost always either discard your core files entirely or hide them away where you need special expertise to find them. Under many situations this may not be what you want, in which case you need to think about what you do want and what's the best way to get it.

I think that your options break down like this:

  • If you're only running distribution-provided programs, you can opt to leave Apport and its kin intact. Intercepting and magically handling core dumps from standard programs is their bread and butter, and the result will probably give you the smoothest way to file bug reports with your distribution. Since you're not running your own programs, you don't care about how Apport (doesn't) handle core dumps for non-system programs.

  • Disable any such system and set kernel.core_pattern to something useful; I like 'core.%u.%p'. If the system only runs your services, with no users having access to it, you might want to have all core dumps written to some central directory that you monitor; otherwise, you probably want to set it so that core dumps go in the process's current directory.

    The drawback of this straightforward approach is that you'll fail to capture core dumps from some processes.

  • Set up your own program to capture core dumps and save them somewhere. The advantage of such a program is that you can capture core dumps under more circumstances and also that you can immediately trigger alerting and other things if particular programs or processes die. You could even identify when you have a core dump for a system program and pass the core dump on to Apport, systemd-coredump, or whatever the distribution's native system is.

    One drawback of this is that if you're not careful, your core dump handler can hang your system.

If you have general people running things on your servers and those things may run into segfaults and otherwise dump core, it's my view that you probably want to do the middle option of just having them write traditional core files to the current directory. People doing development tend to like having core files for debugging, and this option is likely to be a lot easier than trying to educate everyone on how to extract core dumps from the depths of the system (if this is even possible; it's theoretically possible with systemd at least).

Up until now we've just passively accepted the default of Apport on our Ubuntu 16.04 systems, but now that we're considering what we want to change for Ubuntu 18.04 and I've been reminded of this whole issue by Julia Evans' How to get a core dump for a segfault on Linux (where she ran into the Apport issue), I think we want to change things to the traditional 'write a core file' setup (which is how it was in Ubuntu 14.04).

(Also, Apport has had its share of security issues over the years, eg 1, 2.)

PS: Since systemd now wants to handle core dumps, I suspect that this is going to be an issue in more and more Linux distributions. Or maybe everyone is going to make sure that that part of systemd doesn't get turned on.

CoreDumpsOnServers written at 21:56:54; Add Comment

(Previous 10 or go back to April 2018 at 2018/04/24)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.