Wandering Thoughts archives

2015-01-25

The long term problem with ZFS on Linux is its license

Since I've recently praised ZFS on Linux as your only real choice today for an advanced filesystem, I need to bring up the long term downside because, awkwardly, I do believe that btrfs is probably going to be the best pragmatic option in the long term and is going to see wider adoption once it works reliably.

The core of the problem is ZFS's license, which I've written about before. What I didn't write about back then because I didn't know enough at the time was the full effects on ZoL of not being included in distributions. The big effect is it will probably never be easy or supported to make your root filesystem a ZFS pool. Unless distributions restructure their installers (and they have no reason to do so), a ZFS root filesystem needs first class support in the installer and it will almost certainly be rather difficult (both politically and otherwise) to add this. This means no installer-created filesystem can be a ZFS one, and the root filesystem has to be created in the installer.

(Okay, you can shuffle around your root filesystem after the basic install is done. But that's a big pain.)

In turn this means that ZFS on Linux is probably always going to be a thing for experts. To use it you need to leave disk space untouched in the installer (or add disk space later), then at least fetch the ZoL packages from an additional repository and have them auto-install on your kernel. And of course you have to live with a certain amount of lack of integration in all of the bits (especially if you go out of your way to use a ZFS root filesystem).

(And as I've seen there are issues with mixing ZFS and non-ZFS filesystems. I suspect that these issues will turn out to be relatively difficult to fix, if they can be at all. Certainly things seem much more likely to work well if all of your filesystems are ZFS filesystems.)

PS: Note that in general having non-GPLv2, non-bundled kernel modules is not an obstacle to widespread adoption if people want what you have to offer. A large number of people have installed binary modules for their graphics cards, for one glaring example. But I don't think that fetching these modules has been integrated into installers despite how popular they are.

(Also, I may be wrong here. If ZFS becomes sufficiently popular, distributions might at least make it easy for people to make third party augmented installers that have support for ZFS. Note that ZFS support in an installer isn't as simple as the choice of another filesystem; ZFS pools are set up quite differently from normal filesystems and good ZFS root pool support has to override things like setup for software RAID mirroring.)

ZFSOnLinuxRootFSProblem written at 04:20:46; Add Comment

2015-01-23

A problem with gnome-terminal in Fedora 21, and tracking it down

Today I discovered that Fedora 21 subtly broke some part of my environment to the extent that gnome-terminal refuses to start. More than that, it refuses to start with a completely obscure error message:

; gnome-terminal
Error constructing proxy for org.gnome.Terminal:/org/gnome/Terminal/Factory0: Error calling StartServiceByName for org.gnome.Terminal: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.gnome.Terminal exited with status 8

If you're here searching for the cause of this error message, let me translate it: what it really means is that your session's dbus-daemon could not start /usr/libexec/gnome-terminal-server when gnome-terminal asked it to. In many cases, it may be because your system's environment has not initialized $LC_CTYPE or $LANG to some UTF-8 locale at the time that your session was being set up (even if one of these environment variables gets set later, by the time you're running gnome-terminal). In the modern world, increasing amount of Gnome bits absolutely insist on being in a UTF-8 locale and fail hard if they aren't.

Some of you may be going 'what?' here. What you suspect is correct; the modern Gnome 3 'gnome-terminal' program is basically a cover script rather than an actual terminal emulator. Instead of opening up a terminal window itself, it exists to talk over DBus to a master gnome-terminal-server process (which will theoretically get started on demand). It is the g-t-s process that is the actual terminal emulator, creates the windows, starts the shells, and all. And yes, one process handles all of your gnome-terminal windows; if that process ever hits a bug (perhaps because of something happening in one window) and dies, all of them die. Let's hope g-t-s doesn't have any serious bugs.

To find the cause of this issue, well, if I'm being honest a bunch of this was found with an Internet search of the error message. This didn't turn up my exact problem but it did turn up people reporting locale problems and also a mention of gnome-terminal-server, which I hadn't known about before. For actual testing and verification I did several things:

  • first I used strace on gnome-terminal itself, which told me nothing useful.

  • I discovered that starting gnome-terminal-server by hand before running gnome-terminal made everything work.

  • I used dbus-monitor --session to watch DBus messages when I tried to start gnome-terminal. This didn't really tell me anything that I couldn't have seen from the error message, but it did verify that there was really a DBus message being sent.

  • I found the dbus-daemon process that was handling my session DBus and used 'strace -f -p ...' on it while I ran gnome-terminal. This eventually wound up with it starting gnome-terminal-server and g-t-s exiting after writing a message to standard error. Unfortunately the default strace settings truncated the message, so I reran strace while adding '-e write=2' to completely dump all messages to standard error. This got me the helpful error message from g-t-s:
    Non UTF-8 locale (ANSI_X3.4-1968) is not supported!

    (If you're wondering if dbus-daemon sends standard error from either itself or processes that it starts to somewhere useful, ha ha no, sorry, we're all out of luck. As far as I can tell it specifically sends standard error to /dev/null.)

  • I dumped the environment of the dbus-daemon process with 'tr '\0' '\n' </proc/<PID>/environ | less' and inspected what environment variables it had set. This showed that it had been started without my usual $LC_CTYPE setting (cf).

With this in hand I could manually reproduce the problem by trying to start gnome-terminal-server with $LC_CTYPE unset, and then I could fix up my X startup scripts to set $LC_CTYPE before they ran dbus-launch.

(This entry is already long enough so I am going to skip my usual rant about Gnome and especially Gnome 3 making problems like this very difficult for even experienced system administrators to debug because there are now so many opaque moving parts to even running Gnome programs standalone, much less in a full Gnome environment. How is anyone normal supposed to debug this when gnome-terminal can't even be bothered to give you a useful error summary in addition to the detailed error report from DBus?)

GnomeTerminalUTF8Required written at 01:54:19; Add Comment

2015-01-22

How to set up static networking with systemd-networkd, or at least how I did

I recently switched my Fedora 21 office workstation from Fedora's old /etc/init.d/network init script based method of network setup to using the (relatively new) systemd network setup functionality, for reasons that I covered yesterday. The systemd documentation is a little bit scant and not complete, so in the process I accumulated some notes that I'm going to write down.

First, I'm going to assume that you're having networkd take over everything from the ground up, possibly including giving your physical network devices stable names. If you were previously doing this through udev, you'll need to comment out bits of /etc/udev/rules.d/70-persistent-net.rules (or wherever your system put it).

To configure your networking you need to set up two files for each network connection. The first file will describe the underlying device, using .link files for physical devices and .netdev files for VLANs, bridges, and so on. For physical links, you can use various things to identify the device (I use just the MAC address, which matches what I doing in udev) and then set its name with 'Name=' in the '[Link]' section. Just to make you a bit confused, the VLANs set up on a physical device are not configured in its .link file.

The second file describes the actual networking on the device (physical or virtual), including virtual devices associated with it; this is done with .network files. Again you can use various things to identify which device you want to operate on; I used the name of the device (a [Match] section with Name=<whatever>). Most of the setup will be done in the [Network] section, including telling networkd what VLANs to create. If you want IP aliases on a give interface, specify multiple addresses. Although it's not documented, experimentally the last address specified becomes the primary (default) address of the interface, ie the default source address for traffic going out that interface.

(This is unfortunately reversed from what I expected, which was that the first address specified would be the primary. Hopefully the systemd people will not change this behavior but document it, and then provide a way of specifying primary versus secondary addresses.)

If you're setting up IP aliases for an interface, it's important to know that ifconfig will now be misleading. In the old approach, alias interfaces got created (eg 'em0:0') and showed the alias IP. In the networkd world those interfaces are not created and you need to turn to 'ip addr list' in order to see your IP aliases. Not knowing this can be very alarming, since in ifconfig it looks like your aliases disappeared. In general you can expect networkd to give you somewhat different ifconfig and ip output because it does stuff somewhat differently.

For setting up VLANs, the VLAN= name in your physical device's .network file is paired up with the [NetDev] Name= setting in your VLAN's .netdev file. You then create another .network file with a [Match] Name= setting of your VLAN's name to configure the VLAN interface's IP address and so on. Unfortunately this is a bit tedious, since your .netdev VLAN file basically exists to set a single value (the [VLAN] Id= setting); it would be more convenient (although less pure) if you could just put that information into a new [VLAN] section in the .network file that specified Name and Id together.

If you're uniquely specifying physical devices in .link files (eg with a MAC address for all of them, with no wildcards) and devices in .network files, I believe that the filenames of all of these files are arbitrary. I chose to give my VLANs filenames of eg 'em0.151.netdev' (where em0.151 is the interface name) just in case. As you can see, there seems to be relatively little constraint on the interface names and I was able to match the names required by my old Fedora ifcfg-* setup so that I didn't have to change any of my scripts et al.

You don't need to define a lo interface; networkd will set one up automatically and do the right thing.

Once you have everything set up in /etc/systemd/network, you need to enable this by (in my case) 'chkconfig --del network; systemctl enable systemd-networkd' and then rebooting. If you have systemd .service units that want to wait for networking to be up, you also want to enable the systemd-networkd-wait-online.service unit, which does what it says in its manpage, and then make your units depend on it in the usual way. Note that this is not quite the same as setting your SysV init script ordering so that your init scripts came after network, since this service waits for at least one interface to be plugged in to something (unfortunately there's no option to override this). While systemd still creates the 'sys-subsystem-net-devices-<name>.device' pseudo-devices, they will now appear faster and with less configured than they did with the old init scripts.

(I used to wait for the appearance of the em0.151 device as a sign that the underlying em0 device had been fully configured with IP addresses attached and so on. This is no longer the case in the networkd world, so this hack broke on me.)

In another unfortunate thing, there's no syntax checker for networkd files and it is somewhat hard to get warning messages. networkd will log complaints to the systemd journal, but it won't print them out on the console during boot or anything (at least not that I saw). However I believe that you can start or restart it while the system is live and then see if things complain.

(Why yes I did make a mistake the first time around. It turns out that the Label= setting in the [Address] section of .network files is not for a description of what the address is and does not like 'labels' that have spaces or other funny games in them.)

On the whole, systemd-networkd doesn't cover all of the cases but then neither did Fedora ifcfg-* files. I was able to transform all of my rather complex ifcfg-* setup into networkd control files with relatively little effort and hassle and the result came very close to working the first time. My networkd config files have a few more lines than my ifcfg-* files, but on the other hand I feel that I fully understand my networkd files and will in the future even after my current exposure to them fades.

(My ifcfg-* files also contain a certain amount of black magic and superstition, which I'm happy to not be carrying forward, and at least some settings that turn out to be mistakes now that I've actually looked them up.)

SystemdNetworkdSetup written at 00:43:05; Add Comment

2015-01-21

Why I'm switching to systemd's networkd stuff for my networking

Today I gave in to temptation and switched my Fedora 21 office workstation from doing networking through Fedora's old /etc/rc.d/init.d/network init script and its /etc/sysconfig/network-scripts/ifcfg-* system to using systemd-networkd. Before I write about what you have to set up to do this, I want to ramble a bit about why I even thought about it, much less went ahead.

The proximate cause is that I was hoping to get a faster system boot. At some point in the past few Fedora versions, bringing up my machine's networking through the network init script became the single slowest part of booting by a large margin, taking on the order of 20 to 30 seconds (and stalling a number of downstream startup jobs). I had no idea just what was taking so long, but I hoped that by switching to something else I could improve the situation.

The deeper cause is that Fedora's old network init script system is a serious mess. All of the work is done by a massive set of intricate shell scripts that use relatively undocumented environment variables set in ifcfg-* files (and the naming of the files themselves). Given the pile of scripts involved, it's absolutely no surprise to me that it takes forever to grind through processing all of my setup. In general the whole thing has all of the baroque charm of the evolved forms of System V init; the best thing I can say about it is that it generally works and you can build relatively sophisticated static setups with it.

(While there is some documentation for what variables can be set hiding in /usr/share/doc/initscripts/sysconfig.txt, it's not complete and for some things you get to decode the shell scripts yourself.)

What systemd's networkd stuff brings to the table for this is the same thing that systemd brings to the table relative to SysV init scripts: you have a well documented way of specifying what you want, which is then directly handled instead of being run through many, many layers of shell scripts. As an additional benefit it gets handled faster and perhaps better.

(I firmly believe that a mess of fragile shell scripts that source your ifcfg-* files and do magic things is not the right architecture. Robust handling of configuration files requires real parsing and so on, not shell script hackery. I don't really care who takes care of this (I would be just as happy with a completely separate system) and I will say straight up that systemd-networkd is not my favorite implementation of this idea and suffers from various flaws. But I like it more than the other options.)

In theory NetworkManager might fill this ecological niche already. In practice NetworkManager has never felt like something that was oriented towards my environment, instead feeling like it targeted machines and people who were going to do all of this through GUIs, and I've run into some issues with it. In particular I'm pretty sure that I'd struggle quite a bit to find documentation on how to set up a NM configuration (from the command line or in files) that duplicates my current network setup; with systemd, it was all in the manual pages. There is a serious (re)assurance value from seeing what you want to configure be clearly documented.

(My longer range reason for liking systemd's move here is that it may bring more uniformity to how you configure networking setups across various Linux flavours.)

SystemdNetworkdWhy written at 02:08:42; Add Comment

2015-01-16

Using systemd-run to limit something's RAM consumption on the fly

A year ago I wrote about using cgroups to limit something's RAM consumption, for limiting the resources that make'ing Firefox could use when I did it. At the time my approach with an explicitly configured cgroup and the direct use of cgexec was the only way to do it on my machines; although systemd has facilities to do this in general, my version could not do this for ad hoc user-run programs. Well, I've upgraded to Fedora 21 and that's now changed, so here's a quick guide to doing it the systemd way.

The core command is systemd-run, which we use to start a command with various limits set. The basic command is:

systemd-run --user --scope -p LIM1=VAL1 -p LIM2=VAL2 [...] CMD ARG [...]

The --user makes things run as ourselves with no special privileges, and is necessary to get things to run. The --scope basically means 'run this as a subcommand', although systemd considers it a named object while it's running. Systemd-run will make up a name for it (and report the name when it starts your command), or you can use --unit NAME to give it your own name.

The limits you can set are covered in systemd.resource-control. Since systemd is just using cgroups, the limits you can set up are just the cgroup limits (and the documentation will tell you exactly what the mapping is, if you need it). Conveniently, systemd-run allows you to specify memory limits in Gb (or Mb), not just bytes. The specific limits I set up in the original entry give us a final command of:

systemd-run --user --scope -p MemoryLimit=3G -p CPUShares=512 -p BlockIOWeight=500 make

(Here I'm once again running make as my example command.)

You can inspect the parameters of your new scope with 'systemctl show --user <scope>', and change them on the fly with 'systemctl set-property --user <scope> LIM=VAL'. I'll leave potential uses of this up to your imagination. systemd-cgls can be used to show all of the scopes and find any particular one that's running this way (and show its processes).

(It would be nice if systemd-cgtop gave you a nice rundown of what resources were getting used by your confined scope, but as far as I can tell it doesn't. Maybe I'm missing a magic trick here.)

Now, there's a subtle semantic difference between what we're doing here and what I did in the original entry. With cgexec, everything that ran in our confine cgroup shared the same limit even if they were started completely separately. With systemd-run, separately started commands have separate limits; if you start two makes in parallel, each of them can use 3 GB of RAM. I'm not sure yet how you fix this in the official systemd way, but I think it involves defining a slice and then attaching our scopes to it.

(On the other hand, this separation of limits for separate commands may be something you consider a feature.)

Sidebar: systemd-run versus cgexec et al

In Fedora 20 and Fedora 21, cgexec works okay for me but I found that systemd would periodically clear out my custom confine cgroup and I'd have to do 'systemctl restart cgconfig' to recreate it (generally anything that caused systemd to reload itself would do this, including yum package updates that poked systemd). Now that the Fedora 21 version of systemd-run supports -p, using it and doing things the systemd way is just easier.

(I wrap the entire invocation up in a script, of course.)

SystemdForMemoryLimiting written at 02:00:50; Add Comment

2015-01-05

Today on Linux, ZFS is your only real choice for an advanced filesystem

Yesterday I wrote about what I consider advanced filesystems are in general, namely filesystems with the minimum feature of checksums so you know when your data has been damaged and ideally with some ability to use redundancy to repair from damage. As far as I know, today on Linux there are only two filesystems that are advanced in this way: btrfs and ZFS, via ZFS on Linux.

(If you don't care about disk checksums, you have lots of choice among perfectly good filesystems. I would just run ext4 unless you had a good reason to know that eg XFS was a better choice in your particular environment; it's what I do and what most people do, so ext4 gets a lot of exercise and attention.)

In theory, you might choose either and you might even default to btrfs as the in-kernel solution. In practice, I believe that you only have one real choice today and that choice is ZFS on Linux. This is not because ZFS might be better than btrfs on a technical level (although I believe it is), it is simply because people keep having problems with btrfs (the latest example I was exposed to was this one). Far too many things I read about btrfs wind up saying stuff like 'it's been stable for a few months since the last problem' or 'I had a problem recently but it wasn't too bad' or the like. Btrfs does not appear to be stable yet and it doesn't appear likely to be stable any time soon; everything I wrote in 2013 about why not to consider btrfs yet still apply.

Btrfs will hopefully someday be one of the filesystems of the future. But it is not the filesystem of today unless you feel very daring. If you want an advanced filesystem today on Linux, your only real option is ZFS on Linux.

Now, ZoL is not perfect. People do still report problems with it from time to time, including kernel memory issues, and you will want to test it in your environment to make sure it works okay. But from all the reports I've read there are plenty of people running it in production in various ways (in more demanding circumstances than mine) and it isn't blowing up in their faces.

In short, ZFS on Linux is something that you can reasonably consider today, and in practice things will probably work fine. I think that considering btrfs today is demonstrably relatively crazy.

(I'm aware that Facebook is using btrfs internally to some degree. Facebook also has Chris Mason working for them to find and fix their btrfs problems and likely a team that immediately packages those changes up into custom Facebook kernels. See also.)

ZFSOnLinuxvsBtrfsToday written at 02:25:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.