Taking over program names in Linux is generally hard
One reaction to the situation with net-tools versus iproute2, where the Linux code for
netstat, and so on is using old and incomplete interfaces and is
basically unmaintained, is that the new and actively maintained
iproute2 should provide
its own reimplementations of
netstat, and so on that
preserve the interface (or as much of it as possible) while using
modern mechanisms. Setting aside the question of whether the people
developing iproute2 even like the
ifconfig interface and are
willing to spend their time writing a version of it, there are
additional difficulties in doing this kind of name takeover in
The core problem is that existing Linux distributions and existing
systems will already have those programs provided from a completely
different package. This generally has two effects. First, some Linux
distributions will disagree with what you're doing and want to keep
providing those programs from the other package, which means that
the upstream package has to be able to build and install things
without its version of the programs it's theoretically trying to
take over (ie, the new release of iproute2 has to be able to build
without its version of
ifconfig et al).
Second, when distributions decide that they trust and prefer your versions of the programs better than the old ones, they have to be able to do some sort of package upgrade or migration that replaces the other package with a version of your package that has your version of the programs included. There are also inevitably going to be distributions that will want to give users a choice of which version of the programs to install, which means that some of the time the distribution will actually build two binary packages for your package, one with your core tools ('iproute2') and one with your replacements for the other package's programs (a hypothetical 'iproute2-nettools', that has to cleanly replace 'net-tools').
Some of this work has to be done by the developers of the new package; they have to make replacement programs that are compatible enough that users won't complain, and then they have to make it possible to not build these programs or build them but not install them. Other portions of the work have to be done by distributions, who have to package all of this up, make sure that they don't accidentally create package conflicts, make sure package upgrades will work well and won't blow up dependencies, and so on. Since this complicates the lives of distributions and the people preparing packages, it's not something that they're likely to undertake casually. In fact, distributions are probably not likely to undertake it at all unless the developers of the new package actively try to push for it, or unless (and until) the programs in the old package become clearly broken and basically force themselves to be replaced.
(I'm generously assuming here that the old package is truly abandoned and everyone agrees that it has to go sometime. If there are people who want it to stay, you have additional problems.)
All of this is the consequence of there being multiple Linux distributions that will make different decisions and that Linux distributions are developed independently from each other and from the upstream packages. If everything was handled by a single group of developers, such takeovers would have much less to worry about and to coordinate (and you wouldn't have packaging work being done over and over again in different packaging systems).
There's real reasons for Linux to replace ifconfig, netstat, et al
One of the ongoing system administration controversies in Linux is
that there is an ongoing effort to obsolete the old, cross-Unix
standard network administration and diagnosis commands of
netstat and the like and replace them with fresh new Linux specific
suite. Old sysadmins are generally grumpy about this; they consider
it yet another sign of Linux's 'not invented here' attitude that
sees Linux breaking from well-established Unix norms to go its own
way. Although I'm an old sysadmin myself, I don't have this reaction.
Instead, I think
that it might be both sensible and honest for Linux to go off in
this direction. There are two reasons for this, one ostensible and
The ostensible surface issue is that the current code for
ifconfig, and so on operates in an inefficient way. Per various
netstat et al operate by reading various files in
doing this is not the most efficient thing in the world (either on
the kernel side or on netstat's side). You won't notice this on a
small system, but apparently there are real impacts on large ones. Modern commands
ip use Linux's netlink sockets, which are much more
efficient. In theory
ifconfig, and company could be
rewritten to use netlink too; in practice this doesn't seem to have
happened and there may be political issues involving different groups
of developers with different opinions on which way to go.
However, the deeper issue is the interface that netstat, ifconfig,
and company present to users. In practice, these commands are caught
between two masters. On the one hand, the information the tools
present and the questions they let us ask are deeply intertwined
with how the kernel itself does networking, and in general the tools
are very much supposed to report the kernel's reality. On the other
hand, the users expect
ifconfig and so on to have their
traditional interface (in terms of output, command line arguments,
and so on); any number of scripts and tools fish things out of
ifconfig output, for example. As the Linux kernel has changed how
it does networking, this has presented things like
a deep conflict; their traditional output is no longer necessarily
an accurate representation of reality.
For instance, here is
ifconfig output for a network interface on
one of my machines:
; ifconfig -a [...] em0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 128.100.3.XX netmask 255.255.255.0 broadcast 126.96.36.199 inet6 fe80::6245:cbff:fea0:e8dd prefixlen 64 scopeid 0x20<link> ether 60:45:cb:a0:e8:dd txqueuelen 1000 (Ethernet) [...]
There are no other '
em0:...' devices reported by
is unfortunate because this output from
ifconfig is not really an
accurate picture of reality:
; ip -4 addr show em0 [...] inet 128.100.3.XX/24 brd 188.8.131.52 scope global em0 valid_lft forever preferred_lft forever inet 128.100.3.YY/24 brd 184.108.40.206 scope global secondary em0 valid_lft forever preferred_lft forever
This interface has an IP alias, set up through systemd's networkd. Perhaps there once was a day when all IP
aliases on Linux had to be set up through additional alias interfaces,
ifconfig would show, but these days each interface can have
multiple IPs and directly setting them this way is the modern
This issue presents programs like
ifconfig with an unappealing
choice. They can maintain their traditional output, which is now
sometimes a lie but which keeps people's scripts working, or they
can change the output to better match reality and probably break
some scripts. It's likely to be the case that the more they change
their output (and arguments and so on) to match the kernel's current
reality, the more they will break scripts and tools built on top
of them. And some people will argue that those scripts and tools
that would break are already broken, just differently; if you're
ifconfig output on my machine to generate a list of all
of the local IP addresses, you're already wrong.
(If you try to keep the current interface while lying as little as
possible, you wind up having arguments about what to lie about and
how. If you can only list one IPv4 address per interface in
how do you decide which one?)
In a sense, deprecating programs like
have wound up with interfaces that are inaccurate but hard to change
is the honest approach. Their interfaces can't be fixed without
significant amounts of pain and they still work okay for many
systems, so just let them be while encouraging people to switch to
other tools that can be more honest.
(This elaborates on an old tweet of mine.)
PS: I believe that the kernel interfaces that
ifconfig and so on
currently use to get this information are bound by backwards
compatibility issues themselves, so getting
ifconfig to even know
that it was being inaccurate here would probably take code changes.
I'm worried about Wayland but there's not much I can do about it
In a comment on my entry about how I have a boring desktop, Opk asked a very good question:
Does it concern you at all that Wayland may force change on you? It may be a good few years away yet and perhaps fvwm will be ported.
Oh my yes, I'm definitely worried about this (and it turns out that I have been for quite some time, which also goes to show how long Wayland has been slowly moving forward). The FVWM people have said that they're not going to try to write a version of Wayland, which means that when Wayland inevitably takes over I'm going to need a new 'window manager' (in Wayland this is a lot more than just what it is in X) and possibly an entirely new desktop environment to go with it.
The good news is that apparently XWayland provides a reasonably
good way to let X programs still display on a Wayland server, so I
won't be forced to abandon as many X things as I expected. I may
even be able to continue to run remote X programs via SSH and
XWayland, which is important for my work desktop. This X to Wayland bridge will mean
that I can keep not just programs with no Wayland equivalent but
also old favorites like
xterm, where I simply don't want to use
what will be the Wayland equivalent (I don't like gnome-terminal
or konsole very much).
The bad news for me is two-fold. First, I'm not attracted to tiling window managers at all, and since tiling window managers are the in thing, they're the most common alternate window managers for Wayland (based on various things, such as the Arch list). There seems to be a paucity of traditional stacking Wayland WMs that are as configurable as fvwm is, although perhaps there will be alternate methods in Wayland to do things like have keyboard and mouse bindings. It's possible that this will change when Wayland starts becoming more dominant, but I'm not holding my breath; heavily customized Linux desktop environments have been feeling more and more like extreme outliers over the years.
(The people writing tiling Wayland window managers like Sway will probably certainly want there to be, because it will be hard to have a viable alternate environment without them. The question is whether major projects like NetworkManager will oblige or whether NM will use its limited development resources elsewhere.)
So yes, I worry about all of this. But in practice it's a very abstracted worry. To start with, Wayland is still not really here yet. Fedora is using it more, but it's by no means universal even for Gnome (where it's the default), and I believe that KDE (and other supported desktop environments) don't even really try to use it. At this rate it will be years and years before anyone is seriously talking about abandoning X (since Gnome programs will still face pressure to be usable in KDE, Cinnamon, and other desktop environments that haven't yet switched to Wayland).
(I believe that Fedora is out ahead of other other Linux distributions, too. People like Debian will probably be trying to support X and pressure people to support X for years to come.)
(If I had a lot of energy and enthusiasm, perhaps I would be trying to write the stacking, construction kit style Wayland window manager and compositor of my dreams. I don't have anything like that energy. I do hope other people do, and while I'm hoping I hope that they like textual icon managers as much as I do.)
How you run out of inodes on an extN filesystem (on Linux)
I've mentioned that we ran out of inodes on a Linux server and covered what the high level problem was, but I've never described the actual mechanics of how and why you can run out of inodes on a filesystem, or more specifically on an extN filesystem. I have to be specific about the filesystem type, because how this is handled varies from filesystem to filesystem; some either have no limit on how many inodes you can have or have such a high limit that you're extremely unlikely to run into it.
The fundamental reason you can run out of inodes on an extN filesystem
is that extN statically allocates space for inodes; in every extN
filesystem, there is space for so many inodes reserved, and you can
never have any more than this. If you use '
df -i' on an extN
filesystem, you can see this number for the filesystem, and you can
also see it with
dumpe2fs, which will tell you other important
information. Here, let's look at an ext4 filesystem:
# dumpe2fs -h /dev/md10 [...] Block size: 4096 [...] Blocks per group: 32768 [...] Inodes per group: 8192 [...]
I'm showing this information because it leads to the important
parameter for how many inodes any particular extN filesystem has,
which is the bytes/inode ratio (
-i argument). By default
this is 16 KB, ie there will be one inode for every 16 KB of space
in the filesystem, and as the
mke2fs manpage covers, it's
not too sensible to set it below 4 KB (the usual extN block size).
The existence of the bytes/inode ratio gives us a straightforward answer for how you can run a filesystem out of inodes: you simply create lots of files that are smaller than this ratio. ExtN implicitly assumes that each inode will on average use at least 16 KB of disk space; if on average your inodes use less, you will run out of inodes before you run out of disk space. One tricky thing here is that this space doesn't have to be used up by regular files, because other sorts of inodes can be small too. Probably the easiest other source is directories; if you have lots of directories with a relatively small number of subdirectories and files in each, it's quite possible for many of them to be smaller than 16 KB, and in some cases you can have a great many subdirectories.
(In our problem directory hierarchy, almost all of the directories are 4 KB, although a few are significantly larger. And the hierarchy can have a lot of subdirectories when things go wrong.)
Another case is symbolic links. Most symbolic links are quite small, and in fact ext4 may be able to store your symbolic link entirely in the inode itself. This means that you can potentially use up a lot of inodes without using any disk space (well, beyond the space for the directories that the symbolic links are in). There are other sorts of special files that also use little or no disk space, but you probably don't have tons of them in an extN filesystem unless something unusual is going on.
(If you do have tens of thousands of Unix sockets or FIFOs or device files, though, you might want to watch out. Or even tons of zero-length regular files that you're using as flags and a persistence mechanism.)
Most people will never run into this on most filesystems, because most filesystems have an average inode size usage that's well above 16 KB. There usually plenty of files over 16 Kb, not that many symbolic links, and a relatively few (small) directories compared to the regular files. For instance, one of my relatively ordinary Fedora root filesystem has a bytes/inode ratio of roughly 73 Kb per inode, and another is at 41 KB per inode.
(You can work out your filesystem's bytes/inode ratio simply by dividing the space used in KB by the number of inodes used.)
ZFS on Linux's development version now has much better pool recovery for damaged pools
Back in March, I wrote about how much better ZFS pool recovery was coming, along with what turned out to be some additional exciting features, such as the long-awaited feature of shrinking ZFS pools by removing vdevs. The good news for people using ZFS on Linux is that most of both features have very recently made it into the ZFS on Linux development source tree. This is especially relevant and important if you have a damaged ZFS on Linux pool that either doesn't import or panics your system when you do import it.
These changes aren't yet in any ZFS on Linux release and I suspect that they won't appear until 0.8.0 is released someday (ie, they won't be ported into the current 0.7.x release branch). However, it's fairly easy to build ZFS on Linux from source if you need to temporarily run the latest version in order to recover or copy data out of a damaged pool that you can't otherwise get at. I believe that some pool recovery can be done as a one-time import and then you can revert back to a released version of ZFS on Linux to use the now-recovered pool, but certainly not all pool import problems can be repaired like this.
(As far as vdev removal goes, it currently requires permanently
using a version of ZFS that supports it, because it adds a
device_removal feature to your pool that will never deactivate,
This may change at some point in the future, but I wouldn't hold
my breath. It seems miraculous enough that we've gotten vdev removal
after all of these years, even if it's only for single devices and
I haven't tried out either of these features, but I am running a recently built development version of ZFS on Linux with them included and nothing has exploded so far. As far as things go in general, ZFS on Linux has a fairly large test suite and these changes added tests along with their code. And of course they've been tested upstream and OmniOS CE had enough confidence in them to incorporate them.
How we're going to be doing custom NFS mount authorization on Linux
We have a long standing system of custom NFS mount authorization on our current OmniOS-based fileservers. This system has been working reliably for years, but our next generation of fileservers will use a different OS, almost certainly Linux, and our current approach doesn't work on Linux, so we had to develop a new one.
One of the big attributes of our current system is that it doesn't require the clients to do anything special; they do NFS mount requests or NFS activity, and provided that their SSH daemon is running, they get automatically checked and authorized. This is important to making the system completely reliable, which is very important if we're going to use it for our own machines (which are absolutely dependent on NFS working). However, the goals of our NFS authorization have shifted so that we no longer require this for our own machines. In light of that, we decided to adopt a more straightforward approach on Linux, one that requires client machines to explicitly do a manual step on boot before they could get NFS access.
The overall 'authorization' system works via firewall rules, where
only machines in a particular ipset table
can talk to the NFS ports on the fileserver. Control over actual
NFS mounts and NFS level access is still done through
and so on, but you have to be in the ipset table in order to even
get that far. To get authorized, ie to get added to the ipset table,
your client machine makes a connection to a specific TCP port on
the fileserver. This ends up causing a Go program to make a
connection to the SSH server on the client machine and verify its
host key against a
known_hosts file that we maintain; if the key verifies, we add
the client's IP address to the ipset table, and if it fails to
verify, we explicitly remove the client's IP address from the table.
(This connection can be done as simply as '
nc FILESERVER PORT
</dev/null >/dev/null'. In practice clients may want to record the
output from the port, because we spit out status messages, including
potentially important ones about why a machine failed verification.
We syslog them too, but those syslog logs aren't accessible to other
This Go program can actually check and handle multiple IP addresses at once (doing so in parallel). In this mode, it runs from cron every few minutes to re-verify all of the currently authorized hosts. The program is sufficiently fast that it can complete this full re-verification in under a second (and with negligible resource usage); in practice, the speed limit is how long of a timeout we use to wait for machines to respond.
To handle fileserver reboots, verified IPs are persistently recorded by touching a file (with the name of their IP address) in a magic directory. On boot and on re-verification, we merge all of the IPs from this directory with the IPs from the ipset table and verify them all. Any IPs that pass verification but aren't in the ipset table are added back to the table (and any IPs in the ipset table but not recorded on disk are persisted to disk), which means that on boot all IPs will be re-added to the ipset table without the client having to do anything.
Clients theoretically don't have to do anything once they've booted and been authorized, but because things can always go wrong we're going to recommend that they re-poke the magic TCP port every so often from cron, perhaps every five or ten minutes. That will insure that any NFS outage should have a limited duration and thus hopefully a limited impact.
(In theory the parallel Go checker is so fast that we could just
extract all of the client IPs from our
known_hosts and always
try to verify them, say, once every fifteen minutes. In practice I
think we're unlikely to do this because there are various potential
issues and it's probably unlikely to help much in practice.)
We're probably going to provide people with a little Python program
that automatically does the client side of the verification for all
current NFS mounts and all mounts in
/etc/fstab, and then logs
the results and so on. This seems more friendly than asking all of
the people involved to write
their own set of scripts or commands for this.
PS: Our own machines on trusted subnets are handled by just having a blanket allow rule in the firewall for those subnets. You only have to be in the ipset table if you're not on one of those subnets.