2014-07-31
The temptation to rebuild my office machine with its data in ZFS on Linux
A while back, I added a scratch drive to my office workstation to have more space for relatively unimportant data, primarily ISOs for virtual machine installs and low priority virtual machine images and extra disks (which I used to test, eg, virtual fileservers and iSCSI backends). About a month ago I gave into temptation and rebuilt that disk space as a ZFS pool using ZFS on Linux, fundamentally because I've come to really like ZFS in general and I really wanted to try out ZFS on Linux for something real. The whole experience has gone well so far; it wasn't particularly difficult to build DKMS based RPMs from the development repositories, everything has worked fine, and even upgrading kernels has been completely transparent to date.
(The advantage of building for DKMS is that the ZFS modules get automatically rebuilt when I install Fedora kernel upgrades.)
All of this smooth sailing has been steadily ratcheting up the temptation to go whole hog on ZFS on Linux by rebuilding my machine to have all of my user data in a ZFS pool instead of in the current 'filesystems on LVM on software RAID-1' setup (I'd leave the root filesystem and swap as they are now). One of the reasons this is extra tempting is that I actually have an easy path to it because my office workstation only has two main disks (and can easily fit another two temporarily).
I've been thinking about this for a while and so I've come up with a bunch of reasons why this is not quite as crazy as it sounds:
- given that we have plenty of spare disks and I can put an extra
two in my case temporarily, moving to ZFS is basically only a
matter of copying a lot of data to new disks. It would be tedious
but not particularly adventurous.
- I have automated daily backups of almost all of my machine (and
I could make that all of my machine) and I don't keep anything
really crucial on it, like say my email (all of that lives on our
servers). If ZFS totally implodes and eats my data on the machine,
I can restore from them.
- I can simply save my current pair of disks to have a fairly fast
and easy way to migrate back from ZFS. If it doesn't work out, I
can just stick them back in the machine and basically
rsyncmy changes back to my original filesystems. Again, tedious but not adventurous. - My machine has 16GB of RAM and I mostly don't use much of it, which means that I think I'm relatively unlikely to run into what I consider the major potential problem with ZoL. And if I'm wrong, well, migrating back will merely be tedious instead of actively painful.
Set against all of this 'not as crazy as it looks' is that this would be daring and there wouldn't be any greater reason beyond just having my data in ZFS on my local machine. We're extremely unlikely to have any use for ZoL in production for, well, a long time into the future.
(And I probably wouldn't be building up enough confidence with ZoL to do this to my home machine, because because most of these reasons don't apply there. I have no spare data disks, no daily backups, and no space for extra drives in my case at home. On the attractive side, a ZFS L2ARC would neatly solve my SSD dilemma and my case does have room for one.)
One of the reasons that this is quite daring is the possibility that at some point I wouldn't be able to apply Fedora kernel updates or upgrade to a new Fedora version because ZoL doesn't support the newer kernels yet. Since I also use VMWare I already sort of live with this possibility today, but VMWare would be a lot easier to do without for some noticeable amount of time than, well, all of my data.
(The real answer is that that sort of thing would force me to migrate
back to my current setup, either with the rsync approach or just
by throwing another two new disks in and copying things from scratch.)
Still, the temptation is there and it keeps getting stronger the
longer that my current ZoL pool is trouble free (and every time I
run 'zpool scrub' on it and admire the assurance of no errors
being detected).
(Having written this, I suspect that I'll give in in a week or two.)
PS: I don't think I currently have any software that requires filesystem features that ZoL doesn't currently implement, like POSIX ACLs. If I'm wrong about that a move to ZFS would be an immediate failure. I admit that I deliberately put a VMWare virtual machine into my current ZFS pool because I half expected VMWare to choke on ZFS (it hasn't so far).
PPS: I have no particular interest in going all in on ZoL and putting
my root filesystem into ZFS, for various reasons. I like
having the root filesystem in a really simple setup that needs as
few moving parts as possible, so today it's not even in LVM, just
on a software RAID-1 partition (the same is true for swap). Flexible
space management is really overkill because the size of modern disks
means that it's easy to give the root filesystem way more space
than it will ever need. I do crazy things like save a copy of every
RPM I've ever installed or updated through yum and my work root
filesystem (/var included) is still under 60 GB.
2014-07-23
One of SELinux's important limits
People occasionally push SELinux as the cure for security problems and look down on people who routinely disable it (as we do). I have some previously expressed views on this general attitude, but what I feel like pointing out today is that SELinux's security has some important intrinsic limits. One big one is that SELinux only acts at process boundaries.
By its nature, SELinux exists to stop a process (or a collection of them) from doing 'bad things' to the rest of the system and to the outside environment. But there are any number of dangerous exploits that do not cross a process's boundaries this way; the most infamous recent one is Heartbleed. SELinux can do nothing to stop these exploits because they happen entirely inside the process, in spheres fully outside its domain. SELinux can only act if the exploit seeks to exfiltrate data (or influence the outside world) through some new channel that the process does not normally use, and in many cases the exploit doesn't need to do that (and often doesn't bother).
Or in short, SELinux cannot stop your web server or your web browser from getting compromised, only from doing new stuff afterwards. Sending all of the secrets that your browser or server already has access to to someone in the outside world? There's nothing SELinux can do about that (assuming that the attacker is competent). This is a large and damaging territory that SELinux doesn't help with.
(Yes, yes, privilege separation. There are a number of ways in which this is the mathematical security answer instead of the real one, including that most network related programs today are not privilege separated. Chrome exploits also have demonstrated that privilege separation is very hard to make leak-proof.)
2014-07-14
Unmounting recoverable stale NFS mounts on Linux
Suppose that you have NFS mounts go stale on your Linux clients by accident; perhaps you have disabled sharing of some filesystem on the fileserver without quite unmounting it on all the clients first. Now you try to unmount them on the clients and you get the cheerful error message:
# umount /cs/dtr /cs/dtr was not found in /proc/mounts
You have two problems here. The first problem is in umount.nfs
and it is producing the error message you see here. This error
message happens because at least some versions of the kernel helpfully
change the /proc/mounts output for a NFS mount that they have
detected as stale. Instead of the normal output, you get:
fs8:/cs/4/dtr /cs/dtr\040(deleted) nfs rw,nosuid,....
(Instead of plain '/cs/dtr' as the mount point.)
This of course does not match what is in /etc/mtab and umount.nfs
errors out with the above error message. As far as I can tell from
our brief experience with this today, there is no way to cause the
kernel to reverse its addition of this '\040(deleted)' marker.
You can make the filesystem non-stale (by re-exporting it or the
like), have IO to the NFS mount on the client work fine, and the
kernel will still keep it there. You are screwed. To get around
this you need to build a patched version of nfs-utils (see
also). You want
to modify utils/mount/nfsumount.c; search for the error message
to find where.
(Note that compiling nfs-utils only gets you a mount.nfs binary.
This is actually the same program as umount.nfs; it check its
name when invoked to decide what to do, so you need to get it
invoked under the right name in some way.)
Unfortunately you're not out of the woods because as far as I can
tell many versions of the Linux kernel absolutely refuse to let you
unmount a stale NFS mountpoint. The actual umount() system calls
will fail with ESTALE even when you can persuade or patch
umount.nfs to make them. As far as I know the only way to recover
from this is to somehow make the NFS mount non-stale; at this point
a patched umount.nfs can make a umount() system call that will
succeed. Otherwise you get to reboot the NFS client.
(I have tested and both the umount() and umount2() system
calls will fail.)
The kernel side of this problem has apparently been fixed in 3.12
via this commit
(found via here), so on really
modern distributions such as Ubuntu 14.04 all you may need to do
is build a patched umount.nfs. It is very much an issue on older
ones such as Ubuntu 12.04 (and perhaps even CentOS 7, although maybe
this fix got backported). In the mean time try not to let your NFS
mounts become stale, or at least don't let the client kernels notice
that they are stale.
(If an NFS mount is stale on the server but nothing on the client
has noticed yet, you can still successfully unmount it without
problems. But the first df or whatever that gets an ESTALE back
from the server blows everything up.)
For additional information on this see eg Ubuntu bug 974374 or Fedora bug 980088 or this linux-nfs mailing list message and thread.
2014-07-13
Early impressions of CentOS 7
For reasons involving us being unimpressed with Ubuntu 14.04, we're building our second generation iSCSI backends on top of CentOS 7 (basically because it just came out in time). We have recently put the first couple of them into production so now seems a good time to report my early impressions of CentOS 7.
I'll start with the installation, which has impressed me in two different ways. The first is that it does RAID setup the right way: you define filesystems (or swap areas), tell the installer that you want them to be RAID-1, and it magically figures everything out and does it right. The second is that it is the first installer I've ever used that can reliably and cleanly reinstall itself over an already-installed system (and it's even easy to tell it how to do this). You would think that this would be trivial, but I've seen any number of installers explode; a common failure point in Linux installers is assembling existing RAID arrays on the disks then failing to completely disassemble them before it tries to repartition the disks. CentOS 7 has no problems, which is something that I really appreciate.
(Some installers are so bad that one set of build instructions I wrote recently started out with 'if these disks have been used before, completely blank them out with dd beforehand using a live CD'.)
Some people will react badly to the installer being a graphical one and also perhaps somewhat confusing. I find it okay but I don't think it's perfect. It is kind of nice to be able to do steps in basically whatever order works for you instead of being forced into a linear order, but on the other hand it's possible to overlook some things.
After installation, everything has been trouble free so far. While I
think CentOS 7 still uses NetworkManager it does it far better than how
Red Hat Enterprise 6 did; in other words the networking
works and I don't particularly notice that it's using NetworkManager
behind the scenes. We can (and do) set things up in
/etc/sysconfig/network-scripts in the traditional manner. CentOS 7
defaults to 'consistent network device naming' but unlike Ubuntu 14.04
it works and the names are generally sane. On our hardware we get Ethernet device names of
enp1s0f0, enp1s0f1, and enp7s0; the first two are the onboard 10G-T ports
and the third is the add-on 1G card. We can live with that.
(The specific naming scheme that CentOS 7 normally uses is described in the Red Hat documentation here, which I am sad to note needs JavaScript to really see anything.)
CentOS 7 uses systemd and has mostly converted things away from
/etc/init.d startup scripts. Some people may have an explosive
reaction to this shift but I don't; I've been using systemd on my
Fedora systems for some time and I actually like it and think
it's a pretty good init system (see also the second
sidebar here). Everything seems to
work in the usual systemd way and I didn't have any particular
problems adding, eg, a serial getty. I did quite appreciate that
systemd automatically activated a serial getty based on a serial
console being configured in the kernel command line.
Overall I guess the good news is that I don't have anything much to say because stuff just works and I haven't run into any unpleasant surprises. The one thing that stands out is how nice the installer is.
2014-07-11
You want to turn console blanking off on your Linux servers
Let's start with the tweets:
@thatcks: Everyone should strongly consider adding 'consoleblank=0' to the kernel command line on your Linux servers. #sysadmin
@thatcks: The Linux kernel blanking the console screen is both unnecessary and dangerous on modern servers and modern setups. You want it off.
By default if you leave a Linux machine sitting idle at a text console, the kernel will blank the display after a while (I believe it's normally ten minutes of inactivity); Linux has probably done this since the very early days. Back in the time of CRT displays this made a reasonable amount of sense, because it avoided burning in the login prompt or whatever other static text was probably on the screen. Screen burnin is not really an issue in the modern age with LCDs, and it's even less of an issue with modern servers that spend a close approximation to all of their time without a display plugged in at all.
The problem with this console blanking is that it is a kernel function and thus the kernel has to reverse it. More specifically, the kernel has to be fairly alive and responding to the keyboard in order to unblank the screen. There are plenty of ways to get a kernel so hung that it is not alive enough to do this, at which point any helpful diagnostic messages the kernel may have printed on its way down are lost, locked away behind that blank screen. We have had this happen to us more than once.
And that is why you don't want your servers to ever blank their
consoles; it's not getting you anything worthwhile and it can really
hurt you. The best way to disable it is, as I tweeted, to add
'consoleblank=0' to the kernel command line arguments.
(Some people fiddle around with 'setterm -blank 0' in various
ways but the kernel argument is more sure and easier.)
(I found out about 'consoleblank=0' and a bunch of additional
useful information from this stackexchange question and its answers,
when I finally decided to see if we could disable console blanking on
our new iSCSI backends. I admit that my motivation for it was rather
more petty than the reason here; a blank console can sometimes make
their KVM-over-IP Java program freak out in a really irritating way
and I was getting tired of that happening to my sessions.)
2014-07-02
Bash is letting locales destroy shell scripting (at least on Linux)
Here, let me present you something in illustrated form, on a system
where /bin/sh is Bash:
$ cat Demo
#!/bin/sh
for i in "$@"; do
case "$i" in
*[A-Z]*) echo "$i has upper case";;
esac
done
$ env - LANG=en_US.UTF-8 ./Demo a b C y z
b has upper case
C has upper case
y has upper case
z has upper case
$ env - LANG=en_US.UTF-8 /bin/dash ./Demo a b C y z
C has upper case
$ env - ./Demo a b C y z # no locale
C has upper case
$
I challenge you to make sense of either part of Bash's behavior in the en_US.UTF-8 locale.
(Contrary to my initial tweet, this behavior has apparently been in Bash for some time. It's also somewhat system dependent; Bash 4.2.25 on Ubuntu 12.04 behaves this way but 4.2.45 on FreeBSD doesn't.)
There is no two ways to describe this behavior: this is braindamaged.
It is at best robot logic on Bash's part to allow [A-Z] to match
lower case characters. It is also terribly destructive to bash's
utility for shell scripting. If I cannot even count on glob
operations that are not even in a file context operating sanely,
why am I using bash to write shell scripts at all? On many systems,
this means eschewing '#!/bin/sh' entirely because (as we're seeing
here) /bin/sh can be Bash and Bash will behave this way even when
invoked as sh.
(I have to assume that not matching a as upper case is a Bash
bug but that the rest of the behavior is intended. It makes more
sense than the other way around.)
What Bash has done here is to strew land mines in the way of my
scripts working right in what is now a common environment. If I
want to continue using shell scripts I have to start trying to
defensively defeat Bash. What will do it? Today, probably setting
LC_COLLATE=C or better yet LC_ALL=C. In all of my scripts.
I might as well switch to Python or Perl even for small things;
they are clearly less likely to cause me heartburn in the future
by going crazy.
There's another problem with this behavior, which is that it is not
what any other POSIX-compatible shell I could find does (on Ubuntu
14.04). Dash (the normal /bin/sh on many Linuxes), mksh, ksh, and
even zsh don't match here. This means that having Bash as /bin/sh
creates a serious behavior difference, not just adds non-POSIX
features that you may accidentally (or deliberately) use in
'#!/bin/sh' scripts.
(Yes, yes, I've written about this before.
But the examples back then were vaguely sensible things for locales
to apply to. What is happening in the Demo script is very, very
far over the line. What is next, GNU grep deciding that your
'[A-Z]' should match case-independently in some locales? That's
just as justified as what Bash is doing here.)
PS: This is actually making me rethink the idea of having /bin/sh
be Bash on our Ubuntu machines, which is the case for historical
reasons. The pain of rooting out
bashism from our scripts may be less than the pain of dealing with
Bash braindamage.
Sidebar: the bug continues
If you change the [A-Z] to [a-z] and try Demo with all upper
case letters, it will match A-Y but think Z doesn't match. This
is symmetrical in what you could consider a weird way. A quick test
suggests that all other letters besides 'a' (in the [A-Z] case)
and 'Z' (in the [a-z] case) match 'correctly', if we assume that a
case independent match is correct in the first place.
Because I was masochistic tonight this has been filed as GNU Bash bug 108609 (tested against bash git tip), although savannah.gnu.org may have eaten the actual text I put in (it sent the text to me in email but I can't read the text through the web). My bug is primarily to report the missing 'a' and 'Z' and only lightly touches on the whole craziness of [A-Z] matching any lower case characters at all, so I encourage other people to file their own bugs about that. I have opted for a low-stress approach myself since I don't expect my bug report to go anywhere.