2015-01-25
The long term problem with ZFS on Linux is its license
Since I've recently praised ZFS on Linux as your only real choice today for an advanced filesystem, I need to bring up the long term downside because, awkwardly, I do believe that btrfs is probably going to be the best pragmatic option in the long term and is going to see wider adoption once it works reliably.
The core of the problem is ZFS's license, which I've written about before. What I didn't write about back then because I didn't know enough at the time was the full effects on ZoL of not being included in distributions. The big effect is it will probably never be easy or supported to make your root filesystem a ZFS pool. Unless distributions restructure their installers (and they have no reason to do so), a ZFS root filesystem needs first class support in the installer and it will almost certainly be rather difficult (both politically and otherwise) to add this. This means no installer-created filesystem can be a ZFS one, and the root filesystem has to be created in the installer.
(Okay, you can shuffle around your root filesystem after the basic install is done. But that's a big pain.)
In turn this means that ZFS on Linux is probably always going to be a thing for experts. To use it you need to leave disk space untouched in the installer (or add disk space later), then at least fetch the ZoL packages from an additional repository and have them auto-install on your kernel. And of course you have to live with a certain amount of lack of integration in all of the bits (especially if you go out of your way to use a ZFS root filesystem).
(And as I've seen there are issues with mixing ZFS and non-ZFS filesystems. I suspect that these issues will turn out to be relatively difficult to fix, if they can be at all. Certainly things seem much more likely to work well if all of your filesystems are ZFS filesystems.)
PS: Note that in general having non-GPLv2, non-bundled kernel modules is not an obstacle to widespread adoption if people want what you have to offer. A large number of people have installed binary modules for their graphics cards, for one glaring example. But I don't think that fetching these modules has been integrated into installers despite how popular they are.
(Also, I may be wrong here. If ZFS becomes sufficiently popular, distributions might at least make it easy for people to make third party augmented installers that have support for ZFS. Note that ZFS support in an installer isn't as simple as the choice of another filesystem; ZFS pools are set up quite differently from normal filesystems and good ZFS root pool support has to override things like setup for software RAID mirroring.)
2015-01-23
A problem with gnome-terminal in Fedora 21, and tracking it down
Today I discovered that Fedora 21 subtly broke some part of my environment
to the extent that gnome-terminal refuses to start. More than that, it
refuses to start with a completely obscure error message:
; gnome-terminal Error constructing proxy for org.gnome.Terminal:/org/gnome/Terminal/Factory0: Error calling StartServiceByName for org.gnome.Terminal: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.gnome.Terminal exited with status 8
If you're here searching for the cause of this error message, let
me translate it: what it really means is that your session's
dbus-daemon could not start /usr/libexec/gnome-terminal-server
when gnome-terminal asked it to. In many cases, it may be because
your system's environment has not initialized $LC_CTYPE or
$LANG to some UTF-8 locale at the time that your session was
being set up (even if one of these environment variables gets set
later, by the time you're running gnome-terminal). In the modern
world, increasing amount of Gnome bits absolutely insist on being
in a UTF-8 locale and fail hard if they aren't.
Some of you may be going 'what?' here. What you suspect is correct; the modern Gnome 3 'gnome-terminal' program is basically a cover script rather than an actual terminal emulator. Instead of opening up a terminal window itself, it exists to talk over DBus to a master gnome-terminal-server process (which will theoretically get started on demand). It is the g-t-s process that is the actual terminal emulator, creates the windows, starts the shells, and all. And yes, one process handles all of your gnome-terminal windows; if that process ever hits a bug (perhaps because of something happening in one window) and dies, all of them die. Let's hope g-t-s doesn't have any serious bugs.
To find the cause of this issue, well, if I'm being honest a bunch of
this was found with an Internet search of the error message. This didn't
turn up my exact problem but it did turn up people reporting locale
problems and also a mention of gnome-terminal-server, which I hadn't
known about before. For actual testing and verification I did several
things:
- first I used
straceongnome-terminalitself, which told me nothing useful. - I discovered that starting
gnome-terminal-serverby hand before runninggnome-terminalmade everything work. - I used
dbus-monitor --sessionto watch DBus messages when I tried to startgnome-terminal. This didn't really tell me anything that I couldn't have seen from the error message, but it did verify that there was really a DBus message being sent. - I found the
dbus-daemonprocess that was handling my session DBus and used 'strace -f -p ...' on it while I rangnome-terminal. This eventually wound up with it startinggnome-terminal-serverand g-t-s exiting after writing a message to standard error. Unfortunately the defaultstracesettings truncated the message, so I reranstracewhile adding '-e write=2' to completely dump all messages to standard error. This got me the helpful error message from g-t-s:Non UTF-8 locale (ANSI_X3.4-1968) is not supported!(If you're wondering if
dbus-daemonsends standard error from either itself or processes that it starts to somewhere useful, ha ha no, sorry, we're all out of luck. As far as I can tell it specifically sends standard error to/dev/null.) - I dumped the environment of the
dbus-daemonprocess with 'tr '\0' '\n' </proc/<PID>/environ | less' and inspected what environment variables it had set. This showed that it had been started without my usual$LC_CTYPEsetting (cf).
With this in hand I could manually reproduce the problem by trying
to start gnome-terminal-server with $LC_CTYPE unset, and then I
could fix up my X startup scripts to
set $LC_CTYPE before they ran dbus-launch.
(This entry is already long enough so I am going to skip my usual rant
about Gnome and especially Gnome 3 making problems like this very
difficult for even experienced system administrators to debug because
there are now so many opaque moving parts to even running Gnome programs
standalone, much less in a full Gnome environment. How is anyone normal
supposed to debug this when gnome-terminal can't even be bothered to
give you a useful error summary in addition to the detailed error report
from DBus?)
2015-01-22
How to set up static networking with systemd-networkd, or at least how I did
I recently switched my Fedora 21 office workstation from Fedora's
old /etc/init.d/network init script based method of network setup
to using the (relatively new) systemd network setup functionality,
for reasons that I covered yesterday. The
systemd documentation is a little bit scant and not complete, so
in the process I accumulated some notes that I'm going to write
down.
First, I'm going to assume that you're having networkd take over
everything from the ground up, possibly including giving your
physical network devices stable names. If you were previously
doing this through udev, you'll need to comment out bits of
/etc/udev/rules.d/70-persistent-net.rules (or wherever your
system put it).
To configure your networking you need to set up two files for each
network connection. The first file will describe the underlying
device, using .link files for
physical devices and .netdev files for
VLANs, bridges, and so on. For physical links, you can use various
things to identify the device (I use just the MAC address, which
matches what I doing in udev) and then set its name with 'Name='
in the '[Link]' section. Just to make you a bit confused, the
VLANs set up on a physical device are not configured in its .link
file.
The second file describes the actual networking on the device
(physical or virtual), including virtual devices associated with
it; this is done with .network files.
Again you can use various things to identify which device you want
to operate on; I used the name of the device (a [Match] section
with Name=<whatever>). Most of the setup will be done in the
[Network] section, including telling networkd what VLANs to create.
If you want IP aliases on a give interface, specify multiple
addresses. Although it's not documented, experimentally the last
address specified becomes the primary (default) address of the
interface, ie the default source address for traffic going out
that interface.
(This is unfortunately reversed from what I expected, which was that the first address specified would be the primary. Hopefully the systemd people will not change this behavior but document it, and then provide a way of specifying primary versus secondary addresses.)
If you're setting up IP aliases for an interface, it's important
to know that ifconfig will now be misleading. In the old approach,
alias interfaces got created (eg 'em0:0') and showed the alias
IP. In the networkd world those interfaces are not created and you
need to turn to 'ip addr list' in order to see your IP aliases.
Not knowing this can be very alarming, since in ifconfig it looks
like your aliases disappeared. In general you can expect networkd
to give you somewhat different ifconfig and ip output because
it does stuff somewhat differently.
For setting up VLANs, the VLAN= name in your physical device's
.network file is paired up with the [NetDev] Name= setting
in your VLAN's .netdev file. You then create another .network
file with a [Match] Name= setting of your VLAN's name to configure
the VLAN interface's IP address and so on. Unfortunately this is a
bit tedious, since your .netdev VLAN file basically exists to set
a single value (the [VLAN] Id= setting); it would be more
convenient (although less pure) if you could just put that information
into a new [VLAN] section in the .network file that specified
Name and Id together.
If you're uniquely specifying physical devices in .link files (eg
with a MAC address for all of them, with no wildcards) and devices
in .network files, I believe that the filenames of all of these
files are arbitrary. I chose to give my VLANs filenames of eg
'em0.151.netdev' (where em0.151 is the interface name) just in
case. As you can see, there seems to be relatively little constraint
on the interface names and I was able to match the names required
by my old Fedora ifcfg-* setup so that I didn't have to change
any of my scripts et al.
You don't need to define a lo interface; networkd will set one
up automatically and do the right thing.
Once you have everything set up in /etc/systemd/network, you need
to enable this by (in my case) 'chkconfig --del network; systemctl
enable systemd-networkd' and then rebooting. If you have systemd
.service units that want to wait for networking to be up, you
also want to enable the systemd-networkd-wait-online.service unit,
which does what it says in its manpage,
and then make your units depend on it in the usual way. Note that
this is not quite the same as setting your SysV init script ordering
so that your init scripts came after network, since this service
waits for at least one interface to be plugged in to something
(unfortunately there's no option to override this). While systemd
still creates the 'sys-subsystem-net-devices-<name>.device'
pseudo-devices, they will now appear faster and with less configured
than they did with the old init scripts.
(I used to wait for the appearance of the em0.151 device as a
sign that the underlying em0 device had been fully configured
with IP addresses attached and so on. This is no longer the case
in the networkd world, so this hack broke on me.)
In another unfortunate thing, there's no syntax checker for networkd files and it is somewhat hard to get warning messages. networkd will log complaints to the systemd journal, but it won't print them out on the console during boot or anything (at least not that I saw). However I believe that you can start or restart it while the system is live and then see if things complain.
(Why yes I did make a mistake the first time around. It turns out
that the Label= setting in the [Address] section of .network
files is not for a description of what the address is and does not
like 'labels' that have spaces or other funny games in them.)
On the whole, systemd-networkd doesn't cover all of the cases but
then neither did Fedora ifcfg-* files. I was able to transform
all of my rather complex ifcfg-* setup into networkd control files
with relatively little effort and hassle and the result came very
close to working the first time. My networkd config files have a
few more lines than my ifcfg-* files, but on the other hand I
feel that I fully understand my networkd files and will in the
future even after my current exposure to them fades.
(My ifcfg-* files also contain a certain amount of black magic
and superstition, which I'm happy to not be carrying forward, and
at least some settings that turn out to be mistakes now that I've
actually looked them up.)
2015-01-21
Why I'm switching to systemd's networkd stuff for my networking
Today I gave in to temptation and switched
my Fedora 21 office workstation from doing networking through
Fedora's old /etc/rc.d/init.d/network init script and its
/etc/sysconfig/network-scripts/ifcfg-* system to using systemd-networkd.
Before I write about what you have to set up to do this, I want to
ramble a bit about why I even thought about it, much less went
ahead.
The proximate cause is that I was hoping to get a faster system
boot. At some point in the past few Fedora versions, bringing up
my machine's networking through the network init script became
the single slowest part of booting by a large margin, taking on the
order of 20 to 30 seconds (and stalling a number of downstream
startup jobs). I had no idea just what was taking so long, but I
hoped that by switching to something else I could improve the
situation.
The deeper cause is that Fedora's old network init script system
is a serious mess. All of the work is done by a massive set of
intricate shell scripts that use relatively undocumented environment
variables set in ifcfg-* files (and the naming of the files
themselves). Given the pile of scripts involved, it's absolutely
no surprise to me that it takes forever to grind through processing
all of my setup. In general the whole thing has all of the baroque
charm of the evolved forms of System V init; the best thing I can
say about it is that it generally works and you can build relatively
sophisticated static setups with it.
(While there is some documentation for what variables can be set
hiding in /usr/share/doc/initscripts/sysconfig.txt, it's not
complete and for some things you get to decode the shell scripts
yourself.)
What systemd's networkd stuff brings to the table for this is the same thing that systemd brings to the table relative to SysV init scripts: you have a well documented way of specifying what you want, which is then directly handled instead of being run through many, many layers of shell scripts. As an additional benefit it gets handled faster and perhaps better.
(I firmly believe that a mess of fragile shell scripts that source
your ifcfg-* files and do magic things is not the right architecture.
Robust handling of configuration files requires real parsing and
so on, not shell script hackery. I don't really care who takes care
of this (I would be just as happy with a completely separate system)
and I will say straight up that systemd-networkd is not my favorite
implementation of this idea and suffers from various flaws. But I
like it more than the other options.)
In theory NetworkManager might fill this ecological niche already. In practice NetworkManager has never felt like something that was oriented towards my environment, instead feeling like it targeted machines and people who were going to do all of this through GUIs, and I've run into some issues with it. In particular I'm pretty sure that I'd struggle quite a bit to find documentation on how to set up a NM configuration (from the command line or in files) that duplicates my current network setup; with systemd, it was all in the manual pages. There is a serious (re)assurance value from seeing what you want to configure be clearly documented.
(My longer range reason for liking systemd's move here is that it may bring more uniformity to how you configure networking setups across various Linux flavours.)
2015-01-16
Using systemd-run to limit something's RAM consumption on the fly
A year ago I wrote about using cgroups to limit something's
RAM consumption, for limiting the resources
that make'ing Firefox could use when I did it. At the time my
approach with an explicitly configured cgroup and the direct use of
cgexec was the only way to do it on my machines; although systemd has facilities
to do this in general, my version could not do this for ad hoc
user-run programs. Well, I've upgraded to Fedora 21 and that's now
changed, so here's a quick guide to doing it the systemd way.
The core command is systemd-run, which
we use to start a command with various limits set. The basic command
is:
systemd-run --user --scope -p LIM1=VAL1 -p LIM2=VAL2 [...] CMD ARG [...]
The --user makes things run as ourselves with no special privileges,
and is necessary to get things to run. The --scope basically
means 'run this as a subcommand', although systemd considers it a
named object while it's running. Systemd-run will make up a name
for it (and report the name when it starts your command), or you
can use --unit NAME to give it your own name.
The limits you can set are covered in systemd.resource-control. Since systemd is just using cgroups, the limits you can set up are just the cgroup limits (and the documentation will tell you exactly what the mapping is, if you need it). Conveniently, systemd-run allows you to specify memory limits in Gb (or Mb), not just bytes. The specific limits I set up in the original entry give us a final command of:
systemd-run --user --scope -p MemoryLimit=3G -p CPUShares=512 -p BlockIOWeight=500 make
(Here I'm once again running make as my example command.)
You can inspect the parameters of your new scope with 'systemctl
show --user <scope>', and change them on the fly with 'systemctl
set-property --user <scope> LIM=VAL'. I'll leave potential uses
of this up to your imagination. systemd-cgls can be used to show
all of the scopes and find any particular one that's running this
way (and show its processes).
(It would be nice if systemd-cgtop gave you a nice rundown of what
resources were getting used by your confined scope, but as far as I can
tell it doesn't. Maybe I'm missing a magic trick here.)
Now, there's a subtle semantic difference between what we're doing
here and what I did in the original entry. With cgexec,
everything that ran in our confine cgroup shared the same limit
even if they were started completely separately. With systemd-run,
separately started commands have separate limits; if you start two
makes in parallel, each of them can use 3 GB of RAM. I'm not sure
yet how you fix this in the official systemd way, but I think it
involves defining a slice
and then attaching our scopes to it.
(On the other hand, this separation of limits for separate commands may be something you consider a feature.)
Sidebar: systemd-run versus cgexec et al
In Fedora 20 and Fedora 21, cgexec works okay for me but I found
that systemd would periodically clear out my custom confine cgroup
and I'd have to do 'systemctl restart cgconfig' to recreate it
(generally anything that caused systemd to reload itself would do
this, including yum package updates that poked systemd). Now that
the Fedora 21 version of systemd-run supports -p, using it and
doing things the systemd way is just easier.
(I wrap the entire invocation up in a script, of course.)
2015-01-05
Today on Linux, ZFS is your only real choice for an advanced filesystem
Yesterday I wrote about what I consider advanced filesystems are in general, namely filesystems with the minimum feature of checksums so you know when your data has been damaged and ideally with some ability to use redundancy to repair from damage. As far as I know, today on Linux there are only two filesystems that are advanced in this way: btrfs and ZFS, via ZFS on Linux.
(If you don't care about disk checksums, you have lots of choice among perfectly good filesystems. I would just run ext4 unless you had a good reason to know that eg XFS was a better choice in your particular environment; it's what I do and what most people do, so ext4 gets a lot of exercise and attention.)
In theory, you might choose either and you might even default to btrfs as the in-kernel solution. In practice, I believe that you only have one real choice today and that choice is ZFS on Linux. This is not because ZFS might be better than btrfs on a technical level (although I believe it is), it is simply because people keep having problems with btrfs (the latest example I was exposed to was this one). Far too many things I read about btrfs wind up saying stuff like 'it's been stable for a few months since the last problem' or 'I had a problem recently but it wasn't too bad' or the like. Btrfs does not appear to be stable yet and it doesn't appear likely to be stable any time soon; everything I wrote in 2013 about why not to consider btrfs yet still apply.
Btrfs will hopefully someday be one of the filesystems of the future. But it is not the filesystem of today unless you feel very daring. If you want an advanced filesystem today on Linux, your only real option is ZFS on Linux.
Now, ZoL is not perfect. People do still report problems with it from time to time, including kernel memory issues, and you will want to test it in your environment to make sure it works okay. But from all the reports I've read there are plenty of people running it in production in various ways (in more demanding circumstances than mine) and it isn't blowing up in their faces.
In short, ZFS on Linux is something that you can reasonably consider today, and in practice things will probably work fine. I think that considering btrfs today is demonstrably relatively crazy.
(I'm aware that Facebook is using btrfs internally to some degree. Facebook also has Chris Mason working for them to find and fix their btrfs problems and likely a team that immediately packages those changes up into custom Facebook kernels. See also.)