2016-05-26
Why SELinux is inherently complex
The root of SELinux's problems is that SELinux is a complex security mechanism that is hard to get right. Unfortunately this complexity is not (just) simply an implementation artifact of the current SELinux code; instead, it's inherent in what SELinux is trying to do.
What SELinux is trying to do is understand 'valid' program behavior and confine programs to it at a fine grained level in an environment where all of the following are true:
- Programs are large, complex, and can legitimately do many things
(this is especially so because we are really talking about entire
assemblages of programs, not just single binaries). After all,
SELinux is intended to secure things like web servers, database
engines, and mailers, all of which have huge amounts of functionality.
- Programs legitimately access things that are spread all over the
system and intermingled tightly with things that they should not
be able to touch. This requires fine-grained selectivity about
what programs can and cannot access.
- Programs use and rely on outside libraries that can have unpredictable, opaque, and undocumented internal behavior, including about what resources those libraries access. Since we're trying to confine all of the program's observed behavior, this necessarily includes the behavior of the libraries that it uses.
All of this means that thoroughly understanding program behavior is very hard, yet such a thorough understanding is the core prerequisite for a SELinux policy that is both correct and secure. Even when you've got a thorough understanding once, the issue with libraries means that it can be kicked out from underneath you by a library update.
(Such insufficient understanding of program behavior is almost certainly the root cause of a great many of the SELinux issues that got fixed here.)
This complexity is inherent in trying to understand program behavior
in the unconfined environment of a general Unix system, where
programs can touch devices in /dev, configuration files under
/etc, run code from libraries in /lib, run helper programs from
/usr/bin, poke around in files in various places in /var/log
and /var, maybe read things from /usr/lib or /usr/share, make
network calls to various services, and so on. All the while they're
not supposed to be able to look at many things from those places
or do many 'wrong' operations. Your program that does DNS lookups
likely needs to be able to make TCP connections to port 53, but you
probably don't want it to be able to make TCP connections to port
25 (or 22). And maybe it needs to make some additional connections
to local services, depending on what NSS libraries got loaded by
glibc when it parsed /etc/nsswitch.conf.
(Cryptography libraries have historically done some really creative
and crazy things on startup in the name of trying to get some
additional randomness, including reading /etc/passwd and running
ps and netstat. Yes, really (via).)
SELinux can be simple, but it requires massive reorganization of a
typical Linux system and application stack. For example, life would
be much simpler if all confined services ran inside defined directory
trees and had no access to anything outside their tree (ie everything
was basically chroot()'d or close to it); then you could write
really simple file access rules (or at least start with them).
Similar things could be done with services provided to applications
(for example, 'all logging must be done through this interface'),
requirements to explicitly document required incoming and outgoing
network traffic, and so on.
(What all of these do is make it easier to understand expected program behavior, either by limiting what programs can do to start with or by requiring them to explicitly document their behavior in order to have it work at all.)
Sidebar: the configuration change problem
The problem gets much worse when you allow system administrators to substantially change the behavior of programs in unpredictable ways by changing their configurations. There is no scalable automated way to parse program configuration files and determine what they 'should' be doing or accessing based on the configuration, so now you're back to requiring people to recreate that understanding of program behavior, or at least a fragment of it (the part that their configuration changes affected).
This generously assumes that all points where sysadmins can change program configuration come prominently marked with 'if you touch this, you need to do this to the SELinux setup'. As you can experimentally determine today, this is not the case.
2016-05-25
SELinux is beyond saving at this point
SELinux has problems. It has a complexity problem (in that it is quite complex), it has technical problems with important issues like usability and visibility, it has pragmatic problems with getting in the way, and most of all it has a social problem. At this point, I no longer believe that SELinux can be saved and become an important part of the Linux security landscape (at least if Linux remains commonly used).
The fundamental reason why SELinux is beyond saving at this point is that after something like a decade of SELinux's toxic mistake, the only people who are left in the SELinux community are the true believers, the people who believe that SELinux is not a sysadmin usability nightmare, that those who disable it are fools, and so on. That your community narrows is what naturally happens when you double down on calling other people things; if people say you are an idiot for questioning the SELinux way, well, you generally leave.
If the SELinux community was going to change its mind about these
issues, the people involved have had years of opportunities to do
so. Yet the SELinux ship sails on pretty much as it ever has. These
people are never going to consider anything close to what I once
suggested in order to change course; instead, I
confidently expect them to ride the 'SELinux is totally fine' train
all the way into the ground. I'm sure they will be shocked and upset
when something like OpenBSD's pledge() is integrated either in Linux
libraries or as a kernel security module (or both) and people start
switching to it.
(As always, real security is people, not math. A beautiful mathematical security system that people don't really use is far less useful and important than a messy, hacky one that people do use.)
(As for why I care about SELinux despite not using it and thinking it's the wrong way, see this. Also, yes, SELinux can do useful things if you work hard enough.)
2016-05-04
The better way to clear SMART disk complaints, with safety provided by ZFS
A couple of months ago I wrote about clearing SMART complaints
about one of my disks by very carefully
overwriting sectors on it, and how ZFS made this kind of safe. In
a comment, Christian Neukirchen
recommended using hdparm --write-sector to overwrite sectors with
read errors instead of the complicated dance with dd that I used
in my entry. As it happens, that disk
coughed up a hairball of smartd complaints today, so I got a
chance to go through my procedures again and the advice is spot on.
Using hdparm makes things much simpler.
So my revised steps are:
- Scrub my ZFS pool in the hopes that this will make the problem go
away. It didn't, which means that any read errors in the partition
for the ZFS pool is in space that ZFS shouldn't
be using.
- Use
ddto read all of the ZFS partition. I did this with 'dd if=/dev/sdc7 of=/dev/null bs=512k conv=noerror iflag=direct'. This hit several bad spots, each of which produced kernel errors that included a line like this:blk_update_request: I/O error, dev sdc, sector 1748083315
- Use
hdparm --read-sectorto verify that this is indeed the bad sector:hdparm --read-sector 1748083315 /dev/sdc
If this is the correct sector,
hdparmwill report a read error and the kernel will log a failed SATA command. Note that is not a normal disk read, ashdparmis issuing a low-level read, so you don't get a normal message; instead you get something like this:ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata3.00: irq_stat 0x40000001 ata3.00: failed command: READ SECTOR(S) EXT ata3.00: cmd 24/00:01:73:a2:31/00:00:68:00:00/e0 tag 3 pio 512 in res 51/40:00:73:a2:31/00:00:68:00:00/00 Emask 0x9 (media error) [...]The important thing to notice here is that you don't get the sector reported (at least not in decoded form), so you have to rely on getting the sector number correct in the
hdparmcommand instead of being able to cross check it against earlier kernel logs.(Sector 1748083315 is 0x6831a273 in hex. All the bytes are there in the
cmdpart of the message, but clearly shuffled around.) - Use
hdparm --write-sectorto overwrite the sector, forcing it to be spared out:hdparm --write-sector 1748083315 <magic option> /dev/sdc
(
hdparmwill tell you what the hidden magic option you need is when you use--write-sectorwithout it.) - Scrub my ZFS pool again and then re-run the
ddto make sure that I got all of the problems.
I was pretty sure I'd gotten everything even before the re-scrub
and the re-dd scan, because smartd reported that there were no
more currently unreadable (pending) sectors or offline uncorrectable
sectors, both of which it had been complaining about before.
This was a lot easier and more straightforward to go through than
my previous procedure, partly because I can directly reuse the
sector numbers from the kernel error messages without problems and
partly because hdparm does exactly what I want.
There's probably a better way to scan the hard drive for read
errors than dd. I'm a little bit nervous about my 512Kb block
size here potentially hiding a second bad sector that's sufficiently
close to the first, but especially with direct IO I think it's a
tradeoff between speed and thoroughness. Possibly I should explore
how well the badblocks program works here, since it's the obvious
candidate.
(These days I force dd to use direct IO when talking to disks
because that way dd does much less damage to the machine's overall
performance.)
(This is the kind of entry that I write because I just looked up my first entry for how to do it again, so clearly I'm pretty likely to wind up doing this a third time. I could just replace the drive, but at this point I don't have enough drive bay slots in my work machine's case to do this easily. Also, I'm a peculiar combination of stubborn and lazy where it comes to hardware.)
2016-05-02
How I think you set up fair share scheduling under systemd
When I started writing this entry, I was going to say that systemd automatically does fair share scheduling between and describe the mechanisms that make that work. However, this turns out to be false as far as I can see; systemd can easily do fair share scheduling, but it doesn't do this by default.
The basic mechanics of fair share scheduling are straightforward.
If you put all of each user's processes into a separate cgroup it
happens automatically. Well. Sort of. You see,
it's not good enough to put each user into a separate cgroup; you
have to make it a CPU accounting cgroup, and a memory accounting
cgroup, and so on. Systemd normally puts all processes for a single
user under a single cgroup, which you can see in eg systemd-cgls
output and by looking at /sys/fs/cgroup/systemd/user.slice, but
by default it doesn't enable any CPU or memory or IO accounting for
them. Without those enabled, the traditional Linux (and Unix)
behavior of 'every process for itself' still applies.
(You can still use systemd-run to add your own limits here, but I'm not quite sure how this works
out.)
Now, I haven't tested the following, but from reading the documentation
it seems that what you need to do to get fair share scheduling for
users is to enable DefaultCPUAccounting and DefaultBlockIOAccounting
for all user units by creating an appropriate file in
/etc/systemd/user.conf.d, as covered in the systemd-user.conf
manpage
and the systemd.resource-control manpage.
You probably don't want to turn this on for system units, or at least
I wouldn't.
I don't think there's any point in turning on DefaultMemoryAccounting.
As far as I can see there is no kernel control that limits a cgroup's
share of RAM, just the total amount of RAM it can use, so cgroups
just can't enforce a fair share scheduling of RAM the way you can
for CPU time (unless I've overlooked something here). Unfortunately,
missing fair share memory allocation definitely hurts the overall
usefulness of fair share scheduling; if you want to insure that no
user can take an 'unfair' share of the machine, it's often just as
important to limit RAM as CPU usage.
(Having discovered this memory limitation, I suspect that we won't bother trying to enable fair share scheduling in our Ubuntu 16.04 installs.)
2016-04-25
Why you mostly don't want to do in-place Linux version upgrades
I mentioned yesterday that we don't do in-place distribution upgrades, eg to go from Ubuntu 12.04 to 14.04; instead we rebuild starting from scratch. It's my view that in-place upgrades of at least common Linux distributions are often a bad idea for a server fleet even when they're supported. I have three reasons for this, in order of increasing importance.
First, an in place upgrade generally involves more service downtime or at least instability than a server swap. In-place upgrades generally take some time (possibly in the hours range), during which things may be at least a little bit unstable as core portions of the system are swapped around (such as core shared libraries, Apache and MySQL/PostgreSQL installs, the mailer, your IMAP server, and so on). A server swap is a few minutes of downtime and you're done.
Second, it's undeniable that an in-place upgrade is a bit more risky than a server replacement. With a server replacement you can build and test the replacement in advance, and you also can revert back to the old version of the server if there are problems with the new one (which we've had to do a few times). For most Linux servers, an in place OS upgrade is a one way thing that's hard to test.
(In theory you can test it by rebuilding an exact duplicate of your current server and then running it through an in-place upgrade, but if you're going to go to that much more work why not just build a new server to start with?)
But those are relatively small reasons. The big reason to rebuild from scratch is that an OS version change means that it's time to re-evaluate whether what you were customizing on the old OS still needs to be done, if you're doing it the right way, and if you now need additional customizations because of new things on the OS. Or, for that matter, because your own environment has changed and some thing you were reflexively doing is now pointless or wrong. Sometimes this is an obvious need, such as Ubuntu's shift from Upstart in 14.04 LTS to systemd in 16.04, but often it can be more subtle than that. Do you still need that sysctl setting, that kernel module blacklist, or that bug workaround, or has the new release made it obsolete?
Again, in theory you can look into this (and prepare new configuration files for new versions of software) by building out a test server before you do in-place upgrades of your existing fleet. In practice I think it's much easier to do this well and to have everything properly prepared if you start from scratch with the new version. Starting from scratch gives you a totally clean slate where you can carefully track and verify every change you do to a stock install.
Of course all of this assumes that you have spare servers that you can use for this. You may not for various reasons, and in that case an in-place upgrade can be the best option in practice despite everything I've written. And when it is your best option, it's great if your Linux (or other OS) actively supports it (Debian and I believe Ubuntu), as opposed to grudging support (Fedora) or no support at all (RHEL/CentOS).
2016-04-24
Why we have CentOS machines as well as Ubuntu ones
I'll start with the tweets that I ran across semi-recently (via @bridgetkromhout):
@alicegoldfuss: If you're running Ubuntu and some guy comes in and says 'we should use Redhat'...fuck that guy." - @mipsytipsy #SREcon16
mipsytipsy: alright, ppl keep turning this into an OS war; it is not. supporting multiple things is costly so try to avoid it.
This is absolutely true. But, well, sometimes you wind up with exceptions despite how you may feel.
We're an Ubuntu shop; it's the Linux we run and almost all of our machines are Linux machines. Despite this we still have a few CentOS machines lurking around, so today I thought I'd explain why they persist despite their extra support burden.
The easiest machine to explain is the one machine running CentOS 6. It's running CentOS 6 for the simple reason that that's basically the last remaining supported Linux distribution that Sophos PureMessage officially runs on. If we want to keep running PureMessage in our anti-spam setup (and we do), CentOS 6 is it. We'd rather run this machine on Ubuntu and we used to before Sophos's last supported Ubuntu version aged out of support.
Our current generation iSCSI backends run CentOS 7 because of the long support period it gives us. We treat these machines as appliances and freeze them once installed, but we still want at least the possibility of applying security updates if there's a sufficiently big issue (an OpenSSH exposure, for example). Because these machines are so crucial to our environment we want to qualify them once and then never touch them again, and CentOS has a long enough support period to more than cover their expected five year lifespan.
Finally, we have a couple of syslog servers and a console server that run CentOS 7. This is somewhat due to historical reasons, but in general we're happy with this choice; these are machines that are deliberately entirely isolated from our regular management infrastructure and that we want to just sit in a corner and keep working smoothly for as long as possible. Basing them on CentOS 7 gives us a very long support period and means we probably won't touch them again until the hardware is old enough to start worrying us (which will probably take a while).
The common feature here is the really long support period that RHEL and CentOS gives us. If all we want is basic garden variety server functionality (possibly because we're running our own code on top, as with the iSCSI backends), we don't really care about using the latest and greatest software versions and it's an advantage to not have to worry about big things like OS upgrades (which for us is actually 'build completely new instance of the server from scratch'; we don't attempt in-place upgrades of that degree and they probably wouldn't really work anyways for reasons out of the scope of this entry).
2016-04-09
Why your Ubuntu server stalls a while on boot if networking has problems
Yesterday I wrote on how to shoot yourself in the foot by making
a mistake in /etc/network/interfaces.
I kept digging into this today, and so now I can tell you why this
happens and what you can do about it. The simple answer is that it
comes from /etc/init/failsafe.conf.
What failsafe.conf is trying to do is kind of hard to explain
without a background in Upstart (Ubuntu's 'traditional' init system).
A real System V init system is always in a 'runlevel', and this
drives what it does (eg it determines which /etc/rcN.d directory
to process). Upstart sort of half abandons runlevels; they are not
built into Upstart itself and some /etc/init jobs don't use them,
but there's a standard Upstart event to set the runlevel and
many /etc/init jobs are started and stopped based on this runlevel
event.
Let's simplify that: Upstart's runlevel stuff is a way of avoiding
specifying real dependencies for /etc/init jobs and handling them
for /etc/rcN.d scripts. Instead jobs can just say 'start on
runlevel [2345]' and get started once the system has finished its
basic boot processing, whatever that is and whatever it takes.
Since the Upstart runlevel is not built in, something must generate
an appropriate 'runlevel N' event during boot at an appropriate
time. That thing is /etc/init/rc-sysinit.conf, which in turn
must be careful to run only at some appropriate point in Upstart's
boot process, once this basic boot processing is done. When is basic
boot processing done? Well, the rc-sysinit.conf answer is 'when
filesystems are there and static networking is up', by in Upstart
terms means when the filesystem(7)
and static-network-up upstart events
are emitted by something.
So what happens if networking doesn't come fully up, for instance
if your /etc/network/interfaces has a mistake in it? If Upstart
left things as they were, your system would just hang in early boot;
rc-sysinit.conf would be left waiting for an Upstart event that
would never happen. This is what failsafe.conf is there for. It
waits a while for networking to come up, and if that doesn't happen
it emits a special Upstart event that tells rc-sysinit.conf to
go on anyways.
In the abstract this is a sensible idea. In the concrete, failsafe.conf
has a number of problems:
- the timeout is hardcoded, which means that it's guaranteed to
be too long for some people and probably not long enough for
others.
- it doesn't produce any useful messages when it has to delay,
and if you're not using Plymouth
it's totally silent. Servers typically don't run Plymouth.
- Upstart as a whole has a very inflexible view of what 'static
networking is up' means. It apparently requires that every 'auto'
interface listed in
/etc/network/interfacesboth exist and have link signal (have a cable plugged in and be connected to something); see eg this bug and this bug. You don't get to say 'proceed even without link signal' or 'this interface is optional' or the like.
For Ubuntu versions that use Upstart, you can fix this by changing
/etc/init/failsafe.conf to shorten the timeouts and print out
actual messages (anything you output with eg echo will wind up
on the console). We're in the process of doing this locally; I
opted to print out a rather verbose message for my usual reasons.
Of course, all of this is going to be inapplicable in the upcoming
Ubuntu 16.04, since Ubuntu switched from Upstart to systemd as of
15.04 (cf).
However Ubuntu has put something similar to failsafe.conf
into their systemd setup and thus I expect that we'll wind up making
similar modifications to it in some way.
(A true native systemd setup has a completely different and generally more granular way of handling failures to bring up networking, but I don't expect Ubuntu to make that big of a change any time soon.)
2016-04-08
How to shoot yourself in the foot with /etc/network/interfaces on Ubuntu
Today I had one of those self inflicted learning experiences that I get myself into from time to time. I will start with the summary and then tell you the story of how I did this to myself.
The summary is that errors in /etc/network/interfaces can cause
your system to stall silently during boot for a potentially significant
amount of time.
One sort of error is a syntax error or omitting a line. Another sort of error is accidentally duplicating an IP address between an interface's primary address and one of its aliases. If you do the latter, you will get weird errors in log files and from tools that don't actually help you.
How I discovered this is that today I was doing a test install of a new web server in a VM image. Our standard practice for web server hosts is that we don't make their hostname be the actual website name; instead they have a real hostname and then one or more website names as aliases. On most of our web servers, these are IP aliases. However, we're running short of IP addresses on our primary network and when I set up this new host I decided to make its single website just be another A record to its single IP address.
When I reached the end of the install process, I'd forgotten this
detail; instead I thought the server needed the website name added as
an IP alias. So I looked up the IP address for the website name and
slavishly added to /etc/networks/interfaces something like:
auto eth0:0
address <IP>
netmask 255.255.255.0
network <blah>.0
(The sharp eyed will notice that there are two errors here.)
Then I rebooted the machine and it just sat there for quite a while.
After a couple of reboots and poking several things (eg, trying an
older kernel) I wound up looking at interfaces in a rescue shell
and noticed my silly mistake. Or rather, my obvious silly mistake:
I'd left out the 'iface eth0:0 inet static' before the address
et al. So I fixed that and rebooted the machine.
Imagine my surprise when the machine still hung during boot. But
this time I let it sit for long enough that the Ubuntu boot process
timed out whatever it needed to, and the machine actually came up.
When it did, I poked around to try to find out what was wrong and
eventually noticed that I had no eth0:0 alias device. This led
me to notice that the IP address I was trying to give to eth0:0
was the same address that eth0 already had, at which point I
finally figured out what was wrong and was able to fully correct
it.
The good news is that now I know another place to look if an Ubuntu machine has mysterious 'hang during boot' problems. (Technically it was a stall, but stalling several minutes with no messages about it is functionally equivalent to a hang from the sysadmin perspective.)
(This is why I test my install instructions in virtual machines before going to the bother of getting real hardware set up. Sometimes it winds up feeling overly nitpicky, and sometimes very much not.)
2016-03-30
My view of Debian's behavior on package upgrades with new dependencies
In the primary Reddit discussion of my entry about actually learning apt and dpkg, yentity asked an entirely sensible question:
Why is this [apt-get's
--with-new-pkgs] not enabled by default in debian / ubuntu ?
The ultimate answer here is 'because Debian has made a philosophical
choice'. Specifically, Debian has decided that no matter what the
person building the new version of a Debian package wants or feels
is necessary, an 'apt-get upgrade' will never add additional
packages to your system. If the builder of the package insists that
a new version requires an additional package to be installed, it is
better for the upgrade to not happen. Only 'apt-get install <pkg>'
(or 'apt-get dist-upgrade') will ever add new packages to your
system.
Regardless of what you think about its merits, this is a coherent position for Debian to take. In an anarchic packaging environment with individual Debian developers going their own way, it even has a fair amount of appeal. It certainly means that package maintainers have a strong pragmatic incentive not to add new dependencies, which probably serves to minimize it (which is one reason Debian has apt-get behave this way).
My personal view is that I prefer an environment where package builders are trusted to do the right thing with package dependencies in new versions of their packages, whatever that is. Packages can get new dependencies for all sorts of good reasons, including that what used to be a single package is being split up into several ones. As a sysadmin outsider, I'm not in a good position to second guess the package maintainer on what dependencies are right and whether or not a new one is harmful to my system, so in a trustworthy environment I'll just auto-install new dependencies (as we now do on Ubuntu where it's possible).
(The Debian package format has also made some structural decisions that make things like splitting packages more of a pain. In an RPM-based system, other packages often don't notice or care if you split yours up; in a Debian one, they're more likely to notice.)
It's worth pointing out that this trust fundamentally requires work and politics, in that it requires a policy on 'no unneeded dependencies' (and 'no surprises in package upgrades') and then a group of people who are empowered to judge and enforce the policy (overriding package maintainers when necessary). This sort of control probably does not go well with a relatively anarchic project and it's certainly a point of argument (and one could say that Debian already has enough of those).
2016-03-28
An awkward confession and what we should do about it
I have an awkward confession.
At this point, we have been running
Ubuntu machines for at least nine years or so, starting with Ubuntu
6.06 and moving forward from there. In all of that time, one of the
things I haven't done (and I don't think we've done generally) is
really dive in and learn about Debian packaging and package management.
Oh sure, we can fiddle around with apt-get and a number of other
superficial things, we've built modified preseeded install environments,
and I've learned enough to modify existing Debian packages and
rebuild them. But that's all. That leaves vast oceans of both dpkg
and APT usage that we have barely touched, plus all of the additional
tools and scripts around the Debian package ecosystem (some of which
have been mentioned here by commentators).
I don't have a good explanation for why this has happened, and in particular why I haven't dug into Debian package (because diving into things is one of the things that I do). I can put together theories (including me not being entirely fond of Ubuntu even from the start), but it's all just speculation and if I'm honest it's post-facto excuses and rationalization.
But what it definitely is embarrassing and, in the long run, harmful.
There are clearly things in the whole Debian package ecology that
would improve our lives if we knew them (for example, I only recently
discovered apt-get's --with-new-pkgs option). Yet what I can
only describe as my stubborn refusal to dig into Debian packaging
is keeping me from this stuff. I need to fix that. I don't necessarily
need to know all of the advanced stuff (I may never create a Debian
package from scratch), but I should at least understand the big
picture and the details that matter to us.
(It should not be the case that I still know much more about RPM and yum/dnf than I do about the Debian equivalents.)
My goal is to be not necessarily an expert but at least honestly knowledgeable, both about the practical nuts and bolts operation of the system and about how everything works conceptually (including such perennial hot topics for me as the principles of the debconf system).
With all of that said, I have to admit that as yet I haven't figured out where I should start reading. Debian has a lot of documentation, but in the past my experience has been that much of it assumes a certain amount of initial context that I don't have yet. Possibly I should start by just reading through all of the APT and dpkg related manpages, trying to sort everything out, and keeping notes about things that I don't understand. Then I can seek further information.
(As is traditional on Wandering Thoughts, I'm writing this partly to spur myself into action.)