Wandering Thoughts archives

2018-05-30

Taking over program names in Linux is generally hard

One reaction to the situation with net-tools versus iproute2, where the Linux code for ifconfig, netstat, and so on is using old and incomplete interfaces and is basically unmaintained, is that the new and actively maintained iproute2 should provide its own reimplementations of ifconfig, netstat, and so on that preserve the interface (or as much of it as possible) while using modern mechanisms. Setting aside the question of whether the people developing iproute2 even like the ifconfig interface and are willing to spend their time writing a version of it, there are additional difficulties in doing this kind of name takeover in Linux.

The core problem is that existing Linux distributions and existing systems will already have those programs provided from a completely different package. This generally has two effects. First, some Linux distributions will disagree with what you're doing and want to keep providing those programs from the other package, which means that the upstream package has to be able to build and install things without its version of the programs it's theoretically trying to take over (ie, the new release of iproute2 has to be able to build without its version of ifconfig et al).

Second, when distributions decide that they trust and prefer your versions of the programs better than the old ones, they have to be able to do some sort of package upgrade or migration that replaces the other package with a version of your package that has your version of the programs included. There are also inevitably going to be distributions that will want to give users a choice of which version of the programs to install, which means that some of the time the distribution will actually build two binary packages for your package, one with your core tools ('iproute2') and one with your replacements for the other package's programs (a hypothetical 'iproute2-nettools', that has to cleanly replace 'net-tools').

Some of this work has to be done by the developers of the new package; they have to make replacement programs that are compatible enough that users won't complain, and then they have to make it possible to not build these programs or build them but not install them. Other portions of the work have to be done by distributions, who have to package all of this up, make sure that they don't accidentally create package conflicts, make sure package upgrades will work well and won't blow up dependencies, and so on. Since this complicates the lives of distributions and the people preparing packages, it's not something that they're likely to undertake casually. In fact, distributions are probably not likely to undertake it at all unless the developers of the new package actively try to push for it, or unless (and until) the programs in the old package become clearly broken and basically force themselves to be replaced.

(I'm generously assuming here that the old package is truly abandoned and everyone agrees that it has to go sometime. If there are people who want it to stay, you have additional problems.)

All of this is the consequence of there being multiple Linux distributions that will make different decisions and that Linux distributions are developed independently from each other and from the upstream packages. If everything was handled by a single group of developers, such takeovers would have much less to worry about and to coordinate (and you wouldn't have packaging work being done over and over again in different packaging systems).

TakingOverNamesHard written at 01:44:39; Add Comment

2018-05-25

There's real reasons for Linux to replace ifconfig, netstat, et al

One of the ongoing system administration controversies in Linux is that there is an ongoing effort to obsolete the old, cross-Unix standard network administration and diagnosis commands of ifconfig, netstat and the like and replace them with fresh new Linux specific things like ss and the ip suite. Old sysadmins are generally grumpy about this; they consider it yet another sign of Linux's 'not invented here' attitude that sees Linux breaking from well-established Unix norms to go its own way. Although I'm an old sysadmin myself, I don't have this reaction. Instead, I think that it might be both sensible and honest for Linux to go off in this direction. There are two reasons for this, one ostensible and one subtle.

The ostensible surface issue is that the current code for netstat, ifconfig, and so on operates in an inefficient way. Per various people, netstat et al operate by reading various files in /proc, and doing this is not the most efficient thing in the world (either on the kernel side or on netstat's side). You won't notice this on a small system, but apparently there are real impacts on large ones. Modern commands like ss and ip use Linux's netlink sockets, which are much more efficient. In theory netstat, ifconfig, and company could be rewritten to use netlink too; in practice this doesn't seem to have happened and there may be political issues involving different groups of developers with different opinions on which way to go.

(Netstat and ifconfig are part of net-tools, while ss and ip are part of iproute2.)

However, the deeper issue is the interface that netstat, ifconfig, and company present to users. In practice, these commands are caught between two masters. On the one hand, the information the tools present and the questions they let us ask are deeply intertwined with how the kernel itself does networking, and in general the tools are very much supposed to report the kernel's reality. On the other hand, the users expect netstat, ifconfig and so on to have their traditional interface (in terms of output, command line arguments, and so on); any number of scripts and tools fish things out of ifconfig output, for example. As the Linux kernel has changed how it does networking, this has presented things like ifconfig with a deep conflict; their traditional output is no longer necessarily an accurate representation of reality.

For instance, here is ifconfig output for a network interface on one of my machines:

 ; ifconfig -a
 [...]
 em0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 128.100.3.XX  netmask 255.255.255.0  broadcast 128.100.3.255
    inet6 fe80::6245:cbff:fea0:e8dd  prefixlen 64  scopeid 0x20<link>
    ether 60:45:cb:a0:e8:dd  txqueuelen 1000  (Ethernet)
 [...]

There are no other 'em0:...' devices reported by ifconfig, which is unfortunate because this output from ifconfig is not really an accurate picture of reality:

; ip -4 addr show em0
[...]
  inet 128.100.3.XX/24 brd 128.100.3.255 scope global em0
    valid_lft forever preferred_lft forever
  inet 128.100.3.YY/24 brd 128.100.3.255 scope global secondary em0
    valid_lft forever preferred_lft forever

This interface has an IP alias, set up through systemd's networkd. Perhaps there once was a day when all IP aliases on Linux had to be set up through additional alias interfaces, which ifconfig would show, but these days each interface can have multiple IPs and directly setting them this way is the modern approach.

This issue presents programs like ifconfig with an unappealing choice. They can maintain their traditional output, which is now sometimes a lie but which keeps people's scripts working, or they can change the output to better match reality and probably break some scripts. It's likely to be the case that the more they change their output (and arguments and so on) to match the kernel's current reality, the more they will break scripts and tools built on top of them. And some people will argue that those scripts and tools that would break are already broken, just differently; if you're parsing ifconfig output on my machine to generate a list of all of the local IP addresses, you're already wrong.

(If you try to keep the current interface while lying as little as possible, you wind up having arguments about what to lie about and how. If you can only list one IPv4 address per interface in ifconfig, how do you decide which one?)

In a sense, deprecating programs like ifconfig and netstat that have wound up with interfaces that are inaccurate but hard to change is the honest approach. Their interfaces can't be fixed without significant amounts of pain and they still work okay for many systems, so just let them be while encouraging people to switch to other tools that can be more honest.

(This elaborates on an old tweet of mine.)

PS: I believe that the kernel interfaces that ifconfig and so on currently use to get this information are bound by backwards compatibility issues themselves, so getting ifconfig to even know that it was being inaccurate here would probably take code changes.

ReplacingNetstatNotBad written at 01:31:08; Add Comment

2018-05-17

I'm worried about Wayland but there's not much I can do about it

In a comment on my entry about how I have a boring desktop, Opk asked a very good question:

Does it concern you at all that Wayland may force change on you? It may be a good few years away yet and perhaps fvwm will be ported.

Oh my yes, I'm definitely worried about this (and it turns out that I have been for quite some time, which also goes to show how long Wayland has been slowly moving forward). The FVWM people have said that they're not going to try to write a version of Wayland, which means that when Wayland inevitably takes over I'm going to need a new 'window manager' (in Wayland this is a lot more than just what it is in X) and possibly an entirely new desktop environment to go with it.

The good news is that apparently XWayland provides a reasonably good way to let X programs still display on a Wayland server, so I won't be forced to abandon as many X things as I expected. I may even be able to continue to run remote X programs via SSH and XWayland, which is important for my work desktop. This X to Wayland bridge will mean that I can keep not just programs with no Wayland equivalent but also old favorites like xterm, where I simply don't want to use what will be the Wayland equivalent (I don't like gnome-terminal or konsole very much).

The bad news for me is two-fold. First, I'm not attracted to tiling window managers at all, and since tiling window managers are the in thing, they're the most common alternate window managers for Wayland (based on various things, such as the Arch list). There seems to be a paucity of traditional stacking Wayland WMs that are as configurable as fvwm is, although perhaps there will be alternate methods in Wayland to do things like have keyboard and mouse bindings. It's possible that this will change when Wayland starts becoming more dominant, but I'm not holding my breath; heavily customized Linux desktop environments have been feeling more and more like extreme outliers over the years.

Second, it seems at least reasonably likely that a lot of current tray applets and notification systems will stop being general and start becoming tightly bound to mainstream desktop environments like Gnome 3, KDE, and Cinnamon. We've already seen this with Gnome 3 and Cinnamon, which have 'applets' that are now JavaScript extensions that run in the context of the Gnome and Cinnamon shells and simply can't be used outside them. In a Wayland world that focuses attention more than ever on a few mainstream desktop environments, will there be any equivalent of stalonetray and things for it like pnmixer?

(The people writing tiling Wayland window managers like Sway will probably certainly want there to be, because it will be hard to have a viable alternate environment without them. The question is whether major projects like NetworkManager will oblige or whether NM will use its limited development resources elsewhere.)

So yes, I worry about all of this. But in practice it's a very abstracted worry. To start with, Wayland is still not really here yet. Fedora is using it more, but it's by no means universal even for Gnome (where it's the default), and I believe that KDE (and other supported desktop environments) don't even really try to use it. At this rate it will be years and years before anyone is seriously talking about abandoning X (since Gnome programs will still face pressure to be usable in KDE, Cinnamon, and other desktop environments that haven't yet switched to Wayland).

(I believe that Fedora is out ahead of other other Linux distributions, too. People like Debian will probably be trying to support X and pressure people to support X for years to come.)

More significantly, there's nothing I can do about all of this. How Wayland in general and Wayland environments develop is far beyond my ability to influence; in practice I'm a far outlier in window manager and desktop land, and so I'll have to make do with whatever is available. If I'm lucky it will be something generally comparable to my current environment; if I'm not, well, I can use Cinnamon and it will probably survive in a Wayland-only world. I might even learn enough Cinnamon shell and JavaScript to customize it a bit.

(If I had a lot of energy and enthusiasm, perhaps I would be trying to write the stacking, construction kit style Wayland window manager and compositor of my dreams. I don't have anything like that energy. I do hope other people do, and while I'm hoping I hope that they like textual icon managers as much as I do.)

WaylandWorries written at 01:33:05; Add Comment

2018-05-16

How you run out of inodes on an extN filesystem (on Linux)

I've mentioned that we ran out of inodes on a Linux server and covered what the high level problem was, but I've never described the actual mechanics of how and why you can run out of inodes on a filesystem, or more specifically on an extN filesystem. I have to be specific about the filesystem type, because how this is handled varies from filesystem to filesystem; some either have no limit on how many inodes you can have or have such a high limit that you're extremely unlikely to run into it.

The fundamental reason you can run out of inodes on an extN filesystem is that extN statically allocates space for inodes; in every extN filesystem, there is space for so many inodes reserved, and you can never have any more than this. If you use 'df -i' on an extN filesystem, you can see this number for the filesystem, and you can also see it with dumpe2fs, which will tell you other important information. Here, let's look at an ext4 filesystem:

# dumpe2fs -h /dev/md10
[...]
Block size:               4096
[...]
Blocks per group:         32768
[...]
Inodes per group:         8192
[...]

I'm showing this information because it leads to the important parameter for how many inodes any particular extN filesystem has, which is the bytes/inode ratio (mke2fs's -i argument). By default this is 16 KB, ie there will be one inode for every 16 KB of space in the filesystem, and as the mke2fs manpage covers, it's not too sensible to set it below 4 KB (the usual extN block size).

The existence of the bytes/inode ratio gives us a straightforward answer for how you can run a filesystem out of inodes: you simply create lots of files that are smaller than this ratio. ExtN implicitly assumes that each inode will on average use at least 16 KB of disk space; if on average your inodes use less, you will run out of inodes before you run out of disk space. One tricky thing here is that this space doesn't have to be used up by regular files, because other sorts of inodes can be small too. Probably the easiest other source is directories; if you have lots of directories with a relatively small number of subdirectories and files in each, it's quite possible for many of them to be smaller than 16 KB, and in some cases you can have a great many subdirectories.

(In our problem directory hierarchy, almost all of the directories are 4 KB, although a few are significantly larger. And the hierarchy can have a lot of subdirectories when things go wrong.)

Another case is symbolic links. Most symbolic links are quite small, and in fact ext4 may be able to store your symbolic link entirely in the inode itself. This means that you can potentially use up a lot of inodes without using any disk space (well, beyond the space for the directories that the symbolic links are in). There are other sorts of special files that also use little or no disk space, but you probably don't have tons of them in an extN filesystem unless something unusual is going on.

(If you do have tens of thousands of Unix sockets or FIFOs or device files, though, you might want to watch out. Or even tons of zero-length regular files that you're using as flags and a persistence mechanism.)

Most people will never run into this on most filesystems, because most filesystems have an average inode size usage that's well above 16 KB. There usually plenty of files over 16 Kb, not that many symbolic links, and a relatively few (small) directories compared to the regular files. For instance, one of my relatively ordinary Fedora root filesystem has a bytes/inode ratio of roughly 73 Kb per inode, and another is at 41 KB per inode.

(You can work out your filesystem's bytes/inode ratio simply by dividing the space used in KB by the number of inodes used.)

HowInodesRunOut written at 01:10:42; Add Comment

2018-05-12

ZFS on Linux's development version now has much better pool recovery for damaged pools

Back in March, I wrote about how much better ZFS pool recovery was coming, along with what turned out to be some additional exciting features, such as the long-awaited feature of shrinking ZFS pools by removing vdevs. The good news for people using ZFS on Linux is that most of both features have very recently made it into the ZFS on Linux development source tree. This is especially relevant and important if you have a damaged ZFS on Linux pool that either doesn't import or panics your system when you do import it.

(These changes are OpenZFS 9075 and its dependencies such as OpenZFS 8961, and the vdev removal changes, although there are followup fixes to them such as OpenZFS 9290.)

These changes aren't yet in any ZFS on Linux release and I suspect that they won't appear until 0.8.0 is released someday (ie, they won't be ported into the current 0.7.x release branch). However, it's fairly easy to build ZFS on Linux from source if you need to temporarily run the latest version in order to recover or copy data out of a damaged pool that you can't otherwise get at. I believe that some pool recovery can be done as a one-time import and then you can revert back to a released version of ZFS on Linux to use the now-recovered pool, but certainly not all pool import problems can be repaired like this.

(As far as vdev removal goes, it currently requires permanently using a version of ZFS that supports it, because it adds a device_removal feature to your pool that will never deactivate, per zpool-features. This may change at some point in the future, but I wouldn't hold my breath. It seems miraculous enough that we've gotten vdev removal after all of these years, even if it's only for single devices and mirror vdevs.)

I haven't tried out either of these features, but I am running a recently built development version of ZFS on Linux with them included and nothing has exploded so far. As far as things go in general, ZFS on Linux has a fairly large test suite and these changes added tests along with their code. And of course they've been tested upstream and OmniOS CE had enough confidence in them to incorporate them.

ZFSOnLinuxBetterPoolImport written at 22:26:45; Add Comment

2018-05-08

How we're going to be doing custom NFS mount authorization on Linux

We have a long standing system of custom NFS mount authorization on our current OmniOS-based fileservers. This system has been working reliably for years, but our next generation of fileservers will use a different OS, almost certainly Linux, and our current approach doesn't work on Linux, so we had to develop a new one.

One of the big attributes of our current system is that it doesn't require the clients to do anything special; they do NFS mount requests or NFS activity, and provided that their SSH daemon is running, they get automatically checked and authorized. This is important to making the system completely reliable, which is very important if we're going to use it for our own machines (which are absolutely dependent on NFS working). However, the goals of our NFS authorization have shifted so that we no longer require this for our own machines. In light of that, we decided to adopt a more straightforward approach on Linux, one that requires client machines to explicitly do a manual step on boot before they could get NFS access.

The overall 'authorization' system works via firewall rules, where only machines in a particular ipset table can talk to the NFS ports on the fileserver. Control over actual NFS mounts and NFS level access is still done through exportfs and so on, but you have to be in the ipset table in order to even get that far. To get authorized, ie to get added to the ipset table, your client machine makes a connection to a specific TCP port on the fileserver. This ends up causing a Go program to make a connection to the SSH server on the client machine and verify its host key against a known_hosts file that we maintain; if the key verifies, we add the client's IP address to the ipset table, and if it fails to verify, we explicitly remove the client's IP address from the table.

(This connection can be done as simply as 'nc FILESERVER PORT </dev/null >/dev/null'. In practice clients may want to record the output from the port, because we spit out status messages, including potentially important ones about why a machine failed verification. We syslog them too, but those syslog logs aren't accessible to other people.)

This Go program can actually check and handle multiple IP addresses at once (doing so in parallel). In this mode, it runs from cron every few minutes to re-verify all of the currently authorized hosts. The program is sufficiently fast that it can complete this full re-verification in under a second (and with negligible resource usage); in practice, the speed limit is how long of a timeout we use to wait for machines to respond.

To handle fileserver reboots, verified IPs are persistently recorded by touching a file (with the name of their IP address) in a magic directory. On boot and on re-verification, we merge all of the IPs from this directory with the IPs from the ipset table and verify them all. Any IPs that pass verification but aren't in the ipset table are added back to the table (and any IPs in the ipset table but not recorded on disk are persisted to disk), which means that on boot all IPs will be re-added to the ipset table without the client having to do anything.

Clients theoretically don't have to do anything once they've booted and been authorized, but because things can always go wrong we're going to recommend that they re-poke the magic TCP port every so often from cron, perhaps every five or ten minutes. That will insure that any NFS outage should have a limited duration and thus hopefully a limited impact.

(In theory the parallel Go checker is so fast that we could just extract all of the client IPs from our known_hosts and always try to verify them, say, once every fifteen minutes. In practice I think we're unlikely to do this because there are various potential issues and it's probably unlikely to help much in practice.)

We're probably going to provide people with a little Python program that automatically does the client side of the verification for all current NFS mounts and all mounts in /etc/fstab, and then logs the results and so on. This seems more friendly than asking all of the people involved to write their own set of scripts or commands for this.

PS: Our own machines on trusted subnets are handled by just having a blanket allow rule in the firewall for those subnets. You only have to be in the ipset table if you're not on one of those subnets.

CustomMountAuthorizationII written at 00:34:33; Add Comment

By day for May 2018: 8 12 16 17 25 30; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.