Getting Xorg to let you terminate the X server with Ctrl + Alt + Backspace
This is a saga. You can skip to the end for the actual answer if you're impatient.
Yesterday I wrote about the history of terminating the X server with Ctrl + Alt + Backspace. I've known about this feature for a long time, but I only wind up using it very occasionally, even for Cinnamon on my laptop. This infrequent usage explains how I only recently noticed that it had stopped working on my office machine. When I read the Xorg manpage for another reason recently, I stumbled over the current XKB mechanism. I decided to write a little entry about it, which I decided to save for a day when I was extra tired. Then I decided to do some research first, got some surprises, and wrote yesterday's entry instead.
My initial assumption about why C-A-B wasn't working for me was that the Xorg people had switched it off relatively recently (or changed some magic thing in how you had to turn it on). This 2010 SE question and its answers taught me otherwise; the switch had happened a very long time ago, and I was relatively certain that I had used C-A-B since then on my machines. So what had changed?
These days, the X server is mostly configured through configuration
file snippets in a directory; on at least Fedora, this is
/etc/X11/xorg.conf.d. In my office workstation's directory, I
00-keyboard.conf that dated from the start of 2015 and
looked like this:
# Read and parsed by systemd-localed. # It's probably wise not to edit this # file manually too freely. Section "InputClass" Identifier "system-keyboard" MatchIsKeyboard "on" Option "XkbLayout" "us" Option "XkbModel" "pc105+inet" Option "XkbVariant" "terminate:ctrl_alt_bksp," EndSection
I scanned this and said to myself 'well, it's setting the magic XKB
option, so something else must be wrong'. I switched to using XKB
back at the end of 2015, so at
first I though that my
setxkbmap usage was overwriting this.
However inspection of the manpage
told me that I was wrong (the settings are normally merged), and
an almost identical
00-keyboard.conf on my home workstation worked
with my normal
setxkbmap. So yesterday I tiredly posted my history
entry and muttered to myself.
This morning, with fresh eyes, I looked at this again and noticed
the important thing: this file is setting the XKB keyboard variant,
not the XKB options. It should actually be setting
"XkbVariant". Since there's no such keyboard variant, this
actually did nothing except fool me. I might have noticed the issue
if I'd run '
setxkbmap -query', but perhaps not.
All of this leads to the three ways to enable Ctrl + Alt + Backspace
termination of the X server, at least on a systemd based system.
First, as part of your X session startup you can run
to specifically enable C-A-B, among any other XKB changes you're
setxkbmap -option 'compose:rwin' -option 'ctrl:nocaps' -option 'terminate:ctrl_alt_bksp'
Second, you can manually create or edit a configuration file
/etc/X11/xorg.conf.d or your equivalent to specify
this. If you already have a
00-keyboard.conf or the equivalent,
the option you want is:
Option "XkbOptions" "terminate:ctrl_alt_bksp"
(A trailing comma is okay, apparently.)
Third, if you have Fedora or perhaps any systemd-based distribution,
you can configure this the official way by running
with a command like this:
localectl --no-convert set-x11-keymap us pc105+inet "" terminate:ctrl_alt_bksp
There is a bear trap lurking here. That innocent looking
"" is very
important, as covered in the Arch wiki page.
As they write (with my emphasis):
To set a model, variant, or options, all preceeding fields need to be specified, but the preceding fields can be skipped by passing an empty string with
Given that my original xorg.conf snippet had what should be the
XKB options as the XKB variant, it seems very likely that back
in January 2015, something ran
localectl and left out that
(That I didn't really notice for a bit more than three years shows some mixture of how little I use C-A-B and how willing I am to shrug and ignore minor mysteries involving keyboards and X.)
My laptop had been set up and maintained
as a stock Fedora machine; these days that apparently means that
this option isn't enabled in the xorg.conf stuff. Unlike on my
workstation (where I edited 00-keyboard.conf directly), I did it
the official way through
localectl. I determined the other command
line parameters by looking at the existing
believe that on the laptop, the model (the 'pc105+inet' bit) was
blank, as was the variant.
Sidebar: How my machines got to their Xorg keyboard state
I assume that before that early 2015 change, my office workstation's Xorg configuration had the magic XkbOptions setting that made it work. I'm pretty sure that C-A-B worked at some point since 2010 or 2011 or so. My home machine has a 00-keyboard.conf from October 2011, which is about when I installed Fedora on it, with comments that say it was created by system-setup-keyboard, and that has the necessary XkbOptions setting. My office machine's Fedora install dates to 2006, so it might have had any number of configuration oddities that confused things at some point.
(My home machine got a completely new Fedora 15 install in 2011 as part of digging myself out of my Fedora 8 hole. My office workstation never got stuck on an older Fedora release the way my home machine did, so the Fedora install's never been rebuilt from scratch. Sometimes I get vaguely tempted by the idea of a from-scratch rebuild, but then I get terrified of how much picky work it would be just to get back to where I am now.)
A broad overview of how modern Linux systems boot
For reasons beyond the scope of this entry, today I feel like writing down a broad and simplified overview of how modern Linux systems boot. Due to being a sysadmin who has stubbed his toe here repeatedly, I'm going to especially focus on points of failure.
- The system loads and starts the basic bootloader somehow, through either
BIOS MBR booting or UEFI. This can involve many steps on its own
and any number of things can go wrong, such as unsigned UEFI
bootloaders on a Secure Boot system.
Generally these failures are the most total; the system reports there's
nothing to boot, or it repeatedly reboots, or the bootloader aborts
with what is generally a cryptic error message.
On a UEFI system, the bootloader needs to live in the EFI system partition, which is always a FAT32 filesystem. Some people have had luck making this a software RAID mirror with the right superblock format; see the comments on this entry.
- The bootloader loads its configuration file and perhaps additional
modules from somewhere, usually your
/bootbut also perhaps your UEFI system partition. Failures here can result in extremely cryptic errors, dropping you into a GRUB shell, or ideally a message saying 'can't find your menu file'. The configuration file location is usually hardcoded, which is sometimes unfortunate if your distribution has picked a bad spot.
For GRUB, this spot has to be on a filesystem and storage stack that GRUB understands, which is not necessarily the same as what your Linux kernel understands. Fortunately GRUB understands a lot these days, so under normal circumstances you're unlikely to run into this.
(Some GRUB setups have a two stage configuration file, where the first stage just finds and loads the second one. This allows you more flexibility in where the second stage lives, which can be important on UEFI systems.)
- Using your configuration file, the bootloader loads your
chosen Linux kernel and an initial ramdisk into memory and
transfers control to the kernel. The kernel and initramfs image
also need to come from a filesystem that your bootloader understands,
but with GRUB the configuration file allows you to be very flexible
about how they're found and where they come from (and it doesn't have to be the same
grub.cfgis, although on a non-UEFI system both are usually in
There are two things that can go wrong here; your
grub.cfgcan have entries for kernels that don't exist any more, or GRUB can fail to locate and bring up the filesystem where the kernel(s) are stored. The latter can happen if, for example, your
grub.cfghas the wrong UUIDs for your filesystems. It's possible to patch this up on the fly so you can boot your system.
- The kernel starts up, creates PID 1, and runs
/initfrom the initramfs as PID 1. This process and things that it run then flail around doing various things, with the fundamental goal of finding and mounting your real root filesystem and transferring control to it. In the process of doing this it will try to assemble software RAID devices and other storage stuff like LVM, perhaps set sysctls, and so on. The obvious and traditional failure mode here is that the initramfs can't find or mount your root filesystem for some reason; this usually winds up dropping you into some sort of very minimal rescue shell. If this happens to you, you may want to boot from a USB live image instead; they tend to have more tools and a better environment.
(Sometimes the reasons for failure are obscure and annoying.)
On many traditional systems, the initramfs
/initwas its own separate thing, often a shell script, and was thus independent from and different from your system's real init. On systemd based systems, the initramfs
/initis actually systemd itself and so even very early initramfs boot is under systemd's control. In general, a modern initramfs is a real (root) filesystem that processes in the initramfs will see as
/, and its contents (both configuration files and programs) are usually copied from the versions in your root filesystem. You can inspect the whole thing with
Update: It turns out that the initramfs init is still a shell script in some Linux distributions, prominently Debian and Ubuntu. The initramfs init being systemd may be a Red Hat-ism (Fedora and RHEL). Thanks to Ben Hutchings in the comments for the correction.
How the initramfs
/initpivots into running your real system's init daemon on your real system's root filesystem is beyond the scope of this entry. The commands may be simple (systemd just runs '
systemctl switch-root'), but how they work is complicated.
(That systemd is the initramfs
/initis convenient in a way, because it means that you don't need to learn an additional system to inspect how your initramfs works; instead you can just look at the systemd units included in the initramfs and follow along in the systemd log.)
- Your real init system starts up to perform basic system setup to
bring the system to a state that we think of as the normal basic
way it is; basically, this is everything you usually get if you
boot into a modern single user mode. This does things like set
the hostname, mount the root filesystem so it can be written to,
apply your sysctl settings (from the real root filesystem this
time), configure enough networking so that you have a loopback
device and the IPv4 and IPv6 localhost addresses, have udev fiddle
around with hardware, and especially mount all of your local
filesystems (which includes activating underlying storage systems
like software RAID and LVM, if they haven't been activated already
in the initramfs).
The traditional thing that fails here is that one or more of your local filesystems can't be mounted. This often causes this process to abort and drop you into a single user rescue shell environment.
(On a systemd system the hostname is actually set twice, once in the initramfs and then again in this stage.)
- With your local filesystems mounted and other core configuration
in place, your init system continues on to boot your system the
rest of the way. This does things like configure your network
(well, perhaps; these days some systems may defer it until you
log in), start all of the system's daemons, and eventually enable
logins on text consoles and perhaps start a graphical login
environment like GDM or LightDM. At the end of this process, your
system is fully booted.
Things that fail here are problems like a daemon not starting or, more seriously, the system not finding the network devices it expects and so not getting itself on the network at all. Usually the end result is that you still wind up with a login prompt (either a text console or graphics), it's just that there were error messages (which you may not have seen) or some things aren't working. Very few modern systems abort the boot and drop into a rescue environment for failures during this stage.
On a systemd system, this transfers control from the initramfs systemd to the systemd binary on your root filesystem (which takes over as PID 1), but systemd maintains continuity of its state and boot process and you can see the whole thing in
journalctl. The point where the switch happens is reported as 'Starting Switch Root...' and then 'Switching root.'
All of System V init, Upstart, and systemd have this distinction
between the basic system setup steps and the later 'full booting'
steps, but they implement it in different ways. Systemd doesn't
draw a hard distinction between the two phases and you can shim
your own steps into either portion in basically the same way. System
V init tended to implement the early 'single user' stage as a
separate nominal runlevel, runlevel 'S', that the system transitioned
through on the way to its real target runlevel. Upstart is sort of
a hybrid; it has a
startup event that's
emitted to trigger a number of things before things start fully
(This really is an overview. Booting Linux on PC hardware has become a complicated process at the best of times, with a lot of things to set up and fiddle around with.)
The mess Ubuntu 18.04 LTS has made of libreadline
I'll start with my tweet:
I see that Ubuntu still hasn't provided libreadline6 for Ubuntu 18.04 LTS, despite that being the default and best readline to compile against on both 16.04 LTS and 14.04 LTS. Binaries that work across LTS versions? Evidently we can't have that.
Even the new expanded Twitter is too little space to really explain things for people who don't already have some idea of what I'm talking about, so let's expand that out.
Let's suppose that you've built yourself a program that uses GNU Readline on an Ubuntu 14.04 or Ubuntu 16.04 machine (perhaps a custom shell). You have a mixed environment, with common binaries used across multiple hosts (for example, because you have NFS fileservers). When you try to run this program on Ubuntu 18.04, here is what will happen:
<program>: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
What is going on here (besides there being no libreadline.so.6 on
Ubuntu 18.04) is that shared libraries on Linux have a version,
which is the digit you see after the
.so, and programs are linked
against a specific version of each shared library. This shared
library version number changes if the library's ABI changes so that
it wouldn't be safe for old programs to call the new version of the
library (for example because a structure changed size).
The standard Ubuntu (and Debian) naming scheme for shared library packages is that 'libreadline6' is the Ubuntu package for version 6 of libreadline (ie for libreadline.so.6). So you would normally fix this problem on Ubuntu 18.04 by installing the 18.04 version of libreadline6. Unfortunately no such package exists, which is the problem.
Over the years, Ubuntu has packaged various versions of GNU Readline. Of the still supported LTS releases, 14.04 and 16.04 ship libreadline5 and libreadline6, while Ubuntu 18.04 ships libreadline5 and libreadline7 and does not have libreadline6. So, can you build a program on 14.04 or 16.04 so that it uses libreadline5, which would perhaps let you run it on 18.04? Unfortunately you can't even do this, as 14.04 and 16.04 only let you build programs that use libreadline6.
(The libreadline-dev package on 14.04 and 16.04 installs libreadline6-dev, and there is no libreadline5-dev.)
The reason Debian and Ubuntu package multiple versions of GNU Readline is for backwards compatibility, so that programs compiled on older versions of the distribution, using older versions of the shared library, will still run on the new version. That's why libreadline5 is packaged even on 18.04. But this time around, Ubuntu apparently decided to throw GNU Readline backwards compatibility to 14.04 and 16.04 under the bus for some reason, or perhaps they just didn't bother to check and notice despite the fact that this should be a routine check when putting an Ubuntu release together (especially an LTS one).
If you're in this situation, the good news is that there is a simple
manual fix. Just download a suitable binary
.deb of libreadline6
by hand (for example from the 16.04 package) and install it
on your 18.04 system. This appears to work fine and hasn't blown
up on us yet. If you have a lot of 18.04 systems to install, you
probably want to add this to your install automation. Perhaps
someday you'll be able to take it out in favour of installing the
official Ubuntu 18.04 version of libreadline6, but based on the
current state of affairs I wouldn't hold my breath about that.
(There are various standard Ubuntu programs that use GNU Readline,
/usr/bin/ftp. However, they're all specific to the
particular Ubuntu release and so they all use its GNU Readline
version, whatever that is; on 18.04, they all use libreadline7.
Should you copy the 16.04
ftp binary over to an 18.04 machine
you'd have this problem with it too, but there's very little reason
to do that; 18.04 comes with its own version of
ftp, after all.)
PS: Ubuntu's Launchpad.net is such a mess that I can't even tell if this has been reported as a bug. Oh, apparently this is probably the right page, and also it looks like no such bug has been filed. It's sad that I could only find it by the 'Bug Reports' link on the packages.ubuntu.com page for it.
Taking over program names in Linux is generally hard
One reaction to the situation with net-tools versus iproute2, where the Linux code for
netstat, and so on is using old and incomplete interfaces and is
basically unmaintained, is that the new and actively maintained
iproute2 should provide
its own reimplementations of
netstat, and so on that
preserve the interface (or as much of it as possible) while using
modern mechanisms. Setting aside the question of whether the people
developing iproute2 even like the
ifconfig interface and are
willing to spend their time writing a version of it, there are
additional difficulties in doing this kind of name takeover in
The core problem is that existing Linux distributions and existing
systems will already have those programs provided from a completely
different package. This generally has two effects. First, some Linux
distributions will disagree with what you're doing and want to keep
providing those programs from the other package, which means that
the upstream package has to be able to build and install things
without its version of the programs it's theoretically trying to
take over (ie, the new release of iproute2 has to be able to build
without its version of
ifconfig et al).
Second, when distributions decide that they trust and prefer your versions of the programs better than the old ones, they have to be able to do some sort of package upgrade or migration that replaces the other package with a version of your package that has your version of the programs included. There are also inevitably going to be distributions that will want to give users a choice of which version of the programs to install, which means that some of the time the distribution will actually build two binary packages for your package, one with your core tools ('iproute2') and one with your replacements for the other package's programs (a hypothetical 'iproute2-nettools', that has to cleanly replace 'net-tools').
Some of this work has to be done by the developers of the new package; they have to make replacement programs that are compatible enough that users won't complain, and then they have to make it possible to not build these programs or build them but not install them. Other portions of the work have to be done by distributions, who have to package all of this up, make sure that they don't accidentally create package conflicts, make sure package upgrades will work well and won't blow up dependencies, and so on. Since this complicates the lives of distributions and the people preparing packages, it's not something that they're likely to undertake casually. In fact, distributions are probably not likely to undertake it at all unless the developers of the new package actively try to push for it, or unless (and until) the programs in the old package become clearly broken and basically force themselves to be replaced.
(I'm generously assuming here that the old package is truly abandoned and everyone agrees that it has to go sometime. If there are people who want it to stay, you have additional problems.)
All of this is the consequence of there being multiple Linux distributions that will make different decisions and that Linux distributions are developed independently from each other and from the upstream packages. If everything was handled by a single group of developers, such takeovers would have much less to worry about and to coordinate (and you wouldn't have packaging work being done over and over again in different packaging systems).
There's real reasons for Linux to replace ifconfig, netstat, et al
One of the ongoing system administration controversies in Linux is
that there is an ongoing effort to obsolete the old, cross-Unix
standard network administration and diagnosis commands of
netstat and the like and replace them with fresh new Linux specific
suite. Old sysadmins are generally grumpy about this; they consider
it yet another sign of Linux's 'not invented here' attitude that
sees Linux breaking from well-established Unix norms to go its own
way. Although I'm an old sysadmin myself, I don't have this reaction.
Instead, I think
that it might be both sensible and honest for Linux to go off in
this direction. There are two reasons for this, one ostensible and
The ostensible surface issue is that the current code for
ifconfig, and so on operates in an inefficient way. Per various
netstat et al operate by reading various files in
doing this is not the most efficient thing in the world (either on
the kernel side or on netstat's side). You won't notice this on a
small system, but apparently there are real impacts on large ones. Modern commands
ip use Linux's netlink sockets, which are much more
efficient. In theory
ifconfig, and company could be
rewritten to use netlink too; in practice this doesn't seem to have
happened and there may be political issues involving different groups
of developers with different opinions on which way to go.
However, the deeper issue is the interface that netstat, ifconfig,
and company present to users. In practice, these commands are caught
between two masters. On the one hand, the information the tools
present and the questions they let us ask are deeply intertwined
with how the kernel itself does networking, and in general the tools
are very much supposed to report the kernel's reality. On the other
hand, the users expect
ifconfig and so on to have their
traditional interface (in terms of output, command line arguments,
and so on); any number of scripts and tools fish things out of
ifconfig output, for example. As the Linux kernel has changed how
it does networking, this has presented things like
a deep conflict; their traditional output is no longer necessarily
an accurate representation of reality.
For instance, here is
ifconfig output for a network interface on
one of my machines:
; ifconfig -a [...] em0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 128.100.3.XX netmask 255.255.255.0 broadcast 18.104.22.168 inet6 fe80::6245:cbff:fea0:e8dd prefixlen 64 scopeid 0x20<link> ether 60:45:cb:a0:e8:dd txqueuelen 1000 (Ethernet) [...]
There are no other '
em0:...' devices reported by
is unfortunate because this output from
ifconfig is not really an
accurate picture of reality:
; ip -4 addr show em0 [...] inet 128.100.3.XX/24 brd 22.214.171.124 scope global em0 valid_lft forever preferred_lft forever inet 128.100.3.YY/24 brd 126.96.36.199 scope global secondary em0 valid_lft forever preferred_lft forever
This interface has an IP alias, set up through systemd's networkd. Perhaps there once was a day when all IP
aliases on Linux had to be set up through additional alias interfaces,
ifconfig would show, but these days each interface can have
multiple IPs and directly setting them this way is the modern
This issue presents programs like
ifconfig with an unappealing
choice. They can maintain their traditional output, which is now
sometimes a lie but which keeps people's scripts working, or they
can change the output to better match reality and probably break
some scripts. It's likely to be the case that the more they change
their output (and arguments and so on) to match the kernel's current
reality, the more they will break scripts and tools built on top
of them. And some people will argue that those scripts and tools
that would break are already broken, just differently; if you're
ifconfig output on my machine to generate a list of all
of the local IP addresses, you're already wrong.
(If you try to keep the current interface while lying as little as
possible, you wind up having arguments about what to lie about and
how. If you can only list one IPv4 address per interface in
how do you decide which one?)
In a sense, deprecating programs like
have wound up with interfaces that are inaccurate but hard to change
is the honest approach. Their interfaces can't be fixed without
significant amounts of pain and they still work okay for many
systems, so just let them be while encouraging people to switch to
other tools that can be more honest.
(This elaborates on an old tweet of mine.)
PS: I believe that the kernel interfaces that
ifconfig and so on
currently use to get this information are bound by backwards
compatibility issues themselves, so getting
ifconfig to even know
that it was being inaccurate here would probably take code changes.
I'm worried about Wayland but there's not much I can do about it
In a comment on my entry about how I have a boring desktop, Opk asked a very good question:
Does it concern you at all that Wayland may force change on you? It may be a good few years away yet and perhaps fvwm will be ported.
Oh my yes, I'm definitely worried about this (and it turns out that I have been for quite some time, which also goes to show how long Wayland has been slowly moving forward). The FVWM people have said that they're not going to try to write a version of Wayland, which means that when Wayland inevitably takes over I'm going to need a new 'window manager' (in Wayland this is a lot more than just what it is in X) and possibly an entirely new desktop environment to go with it.
The good news is that apparently XWayland provides a reasonably
good way to let X programs still display on a Wayland server, so I
won't be forced to abandon as many X things as I expected. I may
even be able to continue to run remote X programs via SSH and
XWayland, which is important for my work desktop. This X to Wayland bridge will mean
that I can keep not just programs with no Wayland equivalent but
also old favorites like
xterm, where I simply don't want to use
what will be the Wayland equivalent (I don't like gnome-terminal
or konsole very much).
The bad news for me is two-fold. First, I'm not attracted to tiling window managers at all, and since tiling window managers are the in thing, they're the most common alternate window managers for Wayland (based on various things, such as the Arch list). There seems to be a paucity of traditional stacking Wayland WMs that are as configurable as fvwm is, although perhaps there will be alternate methods in Wayland to do things like have keyboard and mouse bindings. It's possible that this will change when Wayland starts becoming more dominant, but I'm not holding my breath; heavily customized Linux desktop environments have been feeling more and more like extreme outliers over the years.
(The people writing tiling Wayland window managers like Sway will probably certainly want there to be, because it will be hard to have a viable alternate environment without them. The question is whether major projects like NetworkManager will oblige or whether NM will use its limited development resources elsewhere.)
So yes, I worry about all of this. But in practice it's a very abstracted worry. To start with, Wayland is still not really here yet. Fedora is using it more, but it's by no means universal even for Gnome (where it's the default), and I believe that KDE (and other supported desktop environments) don't even really try to use it. At this rate it will be years and years before anyone is seriously talking about abandoning X (since Gnome programs will still face pressure to be usable in KDE, Cinnamon, and other desktop environments that haven't yet switched to Wayland).
(I believe that Fedora is out ahead of other other Linux distributions, too. People like Debian will probably be trying to support X and pressure people to support X for years to come.)
(If I had a lot of energy and enthusiasm, perhaps I would be trying to write the stacking, construction kit style Wayland window manager and compositor of my dreams. I don't have anything like that energy. I do hope other people do, and while I'm hoping I hope that they like textual icon managers as much as I do.)
How you run out of inodes on an extN filesystem (on Linux)
I've mentioned that we ran out of inodes on a Linux server and covered what the high level problem was, but I've never described the actual mechanics of how and why you can run out of inodes on a filesystem, or more specifically on an extN filesystem. I have to be specific about the filesystem type, because how this is handled varies from filesystem to filesystem; some either have no limit on how many inodes you can have or have such a high limit that you're extremely unlikely to run into it.
The fundamental reason you can run out of inodes on an extN filesystem
is that extN statically allocates space for inodes; in every extN
filesystem, there is space for so many inodes reserved, and you can
never have any more than this. If you use '
df -i' on an extN
filesystem, you can see this number for the filesystem, and you can
also see it with
dumpe2fs, which will tell you other important
information. Here, let's look at an ext4 filesystem:
# dumpe2fs -h /dev/md10 [...] Block size: 4096 [...] Blocks per group: 32768 [...] Inodes per group: 8192 [...]
I'm showing this information because it leads to the important
parameter for how many inodes any particular extN filesystem has,
which is the bytes/inode ratio (
-i argument). By default
this is 16 KB, ie there will be one inode for every 16 KB of space
in the filesystem, and as the
mke2fs manpage covers, it's
not too sensible to set it below 4 KB (the usual extN block size).
The existence of the bytes/inode ratio gives us a straightforward answer for how you can run a filesystem out of inodes: you simply create lots of files that are smaller than this ratio. ExtN implicitly assumes that each inode will on average use at least 16 KB of disk space; if on average your inodes use less, you will run out of inodes before you run out of disk space. One tricky thing here is that this space doesn't have to be used up by regular files, because other sorts of inodes can be small too. Probably the easiest other source is directories; if you have lots of directories with a relatively small number of subdirectories and files in each, it's quite possible for many of them to be smaller than 16 KB, and in some cases you can have a great many subdirectories.
(In our problem directory hierarchy, almost all of the directories are 4 KB, although a few are significantly larger. And the hierarchy can have a lot of subdirectories when things go wrong.)
Another case is symbolic links. Most symbolic links are quite small, and in fact ext4 may be able to store your symbolic link entirely in the inode itself. This means that you can potentially use up a lot of inodes without using any disk space (well, beyond the space for the directories that the symbolic links are in). There are other sorts of special files that also use little or no disk space, but you probably don't have tons of them in an extN filesystem unless something unusual is going on.
(If you do have tens of thousands of Unix sockets or FIFOs or device files, though, you might want to watch out. Or even tons of zero-length regular files that you're using as flags and a persistence mechanism.)
Most people will never run into this on most filesystems, because most filesystems have an average inode size usage that's well above 16 KB. There usually plenty of files over 16 Kb, not that many symbolic links, and a relatively few (small) directories compared to the regular files. For instance, one of my relatively ordinary Fedora root filesystem has a bytes/inode ratio of roughly 73 Kb per inode, and another is at 41 KB per inode.
(You can work out your filesystem's bytes/inode ratio simply by dividing the space used in KB by the number of inodes used.)
ZFS on Linux's development version now has much better pool recovery for damaged pools
Back in March, I wrote about how much better ZFS pool recovery was coming, along with what turned out to be some additional exciting features, such as the long-awaited feature of shrinking ZFS pools by removing vdevs. The good news for people using ZFS on Linux is that most of both features have very recently made it into the ZFS on Linux development source tree. This is especially relevant and important if you have a damaged ZFS on Linux pool that either doesn't import or panics your system when you do import it.
These changes aren't yet in any ZFS on Linux release and I suspect that they won't appear until 0.8.0 is released someday (ie, they won't be ported into the current 0.7.x release branch). However, it's fairly easy to build ZFS on Linux from source if you need to temporarily run the latest version in order to recover or copy data out of a damaged pool that you can't otherwise get at. I believe that some pool recovery can be done as a one-time import and then you can revert back to a released version of ZFS on Linux to use the now-recovered pool, but certainly not all pool import problems can be repaired like this.
(As far as vdev removal goes, it currently requires permanently
using a version of ZFS that supports it, because it adds a
device_removal feature to your pool that will never deactivate,
This may change at some point in the future, but I wouldn't hold
my breath. It seems miraculous enough that we've gotten vdev removal
after all of these years, even if it's only for single devices and
I haven't tried out either of these features, but I am running a recently built development version of ZFS on Linux with them included and nothing has exploded so far. As far as things go in general, ZFS on Linux has a fairly large test suite and these changes added tests along with their code. And of course they've been tested upstream and OmniOS CE had enough confidence in them to incorporate them.
How we're going to be doing custom NFS mount authorization on Linux
We have a long standing system of custom NFS mount authorization on our current OmniOS-based fileservers. This system has been working reliably for years, but our next generation of fileservers will use a different OS, almost certainly Linux, and our current approach doesn't work on Linux, so we had to develop a new one.
One of the big attributes of our current system is that it doesn't require the clients to do anything special; they do NFS mount requests or NFS activity, and provided that their SSH daemon is running, they get automatically checked and authorized. This is important to making the system completely reliable, which is very important if we're going to use it for our own machines (which are absolutely dependent on NFS working). However, the goals of our NFS authorization have shifted so that we no longer require this for our own machines. In light of that, we decided to adopt a more straightforward approach on Linux, one that requires client machines to explicitly do a manual step on boot before they could get NFS access.
The overall 'authorization' system works via firewall rules, where
only machines in a particular ipset table
can talk to the NFS ports on the fileserver. Control over actual
NFS mounts and NFS level access is still done through
and so on, but you have to be in the ipset table in order to even
get that far. To get authorized, ie to get added to the ipset table,
your client machine makes a connection to a specific TCP port on
the fileserver. This ends up causing a Go program to make a
connection to the SSH server on the client machine and verify its
host key against a
known_hosts file that we maintain; if the key verifies, we add
the client's IP address to the ipset table, and if it fails to
verify, we explicitly remove the client's IP address from the table.
(This connection can be done as simply as '
nc FILESERVER PORT
</dev/null >/dev/null'. In practice clients may want to record the
output from the port, because we spit out status messages, including
potentially important ones about why a machine failed verification.
We syslog them too, but those syslog logs aren't accessible to other
This Go program can actually check and handle multiple IP addresses at once (doing so in parallel). In this mode, it runs from cron every few minutes to re-verify all of the currently authorized hosts. The program is sufficiently fast that it can complete this full re-verification in under a second (and with negligible resource usage); in practice, the speed limit is how long of a timeout we use to wait for machines to respond.
To handle fileserver reboots, verified IPs are persistently recorded by touching a file (with the name of their IP address) in a magic directory. On boot and on re-verification, we merge all of the IPs from this directory with the IPs from the ipset table and verify them all. Any IPs that pass verification but aren't in the ipset table are added back to the table (and any IPs in the ipset table but not recorded on disk are persisted to disk), which means that on boot all IPs will be re-added to the ipset table without the client having to do anything.
Clients theoretically don't have to do anything once they've booted and been authorized, but because things can always go wrong we're going to recommend that they re-poke the magic TCP port every so often from cron, perhaps every five or ten minutes. That will insure that any NFS outage should have a limited duration and thus hopefully a limited impact.
(In theory the parallel Go checker is so fast that we could just
extract all of the client IPs from our
known_hosts and always
try to verify them, say, once every fifteen minutes. In practice I
think we're unlikely to do this because there are various potential
issues and it's probably unlikely to help much in practice.)
We're probably going to provide people with a little Python program
that automatically does the client side of the verification for all
current NFS mounts and all mounts in
/etc/fstab, and then logs
the results and so on. This seems more friendly than asking all of
the people involved to write
their own set of scripts or commands for this.
PS: Our own machines on trusted subnets are handled by just having a blanket allow rule in the firewall for those subnets. You only have to be in the ipset table if you're not on one of those subnets.
You probably need to think about how to handle core dumps on modern Linux servers
Once upon a time, life was simple. If and when your programs hit
fatal problems, they generally
dumped core in their current directory under the name
(sometimes you could make them be
core.<PID>). You might or might
not ever notice these core files, and some of the time they might
not get written at all because of various permissions issues (see
the core(5) manpage).
Then complications ensued due to things like Apport, ABRT, and
where an increasing number of Linux distributions have decided to
take advantage of the full power of the
sysctl to capture core dumps themselves.
(The Ubuntu Apport documentation claims that it's disabled by default on 'stable' releases. This does not appear to be true any more.)
In a perfect world, systems like Apport would capture core dumps
from system programs for themselves and arrange that everything
else was handled in the traditional way, by writing a
Unfortunately this is not a perfect world. In this world, systems
like Apport almost always either discard your core files entirely
or hide them away where you need special expertise to find them.
Under many situations this may not be what you want, in which case
you need to think about what you do want and what's the best way
to get it.
I think that your options break down like this:
- If you're only running distribution-provided programs, you can
opt to leave Apport and its kin intact. Intercepting and magically
handling core dumps from standard programs is their bread and butter,
and the result will probably give you the smoothest way to file bug
reports with your distribution. Since you're not running your own
programs, you don't care about how Apport (doesn't) handle core dumps
for non-system programs.
- Disable any such system and set
kernel.core_patternto something useful; I like '
core.%u.%p'. If the system only runs your services, with no users having access to it, you might want to have all core dumps written to some central directory that you monitor; otherwise, you probably want to set it so that core dumps go in the process's current directory.
The drawback of this straightforward approach is that you'll fail to capture core dumps from some processes.
- Set up your own program to capture core dumps and save them
somewhere. The advantage of such a program is that you can capture
core dumps under more circumstances and also that you can immediately
trigger alerting and other things if particular programs or
processes die. You could even identify when you have a core dump
for a system program and pass the core dump on to Apport,
systemd-coredump, or whatever the distribution's native system is.
One drawback of this is that if you're not careful, your core dump handler can hang your system.
If you have general people running things on your servers and those things may run into segfaults and otherwise dump core, it's my view that you probably want to do the middle option of just having them write traditional core files to the current directory. People doing development tend to like having core files for debugging, and this option is likely to be a lot easier than trying to educate everyone on how to extract core dumps from the depths of the system (if this is even possible; it's theoretically possible with systemd at least).
Up until now we've just passively accepted the default of Apport on our Ubuntu 16.04 systems, but now that we're considering what we want to change for Ubuntu 18.04 and I've been reminded of this whole issue by Julia Evans' How to get a core dump for a segfault on Linux (where she ran into the Apport issue), I think we want to change things to the traditional 'write a core file' setup (which is how it was in Ubuntu 14.04).
PS: Since systemd now wants to handle core dumps, I suspect that this is going to be an issue in more and more Linux distributions. Or maybe everyone is going to make sure that that part of systemd doesn't get turned on.