2016-05-26
Why SELinux is inherently complex
The root of SELinux's problems is that SELinux is a complex security mechanism that is hard to get right. Unfortunately this complexity is not (just) simply an implementation artifact of the current SELinux code; instead, it's inherent in what SELinux is trying to do.
What SELinux is trying to do is understand 'valid' program behavior and confine programs to it at a fine grained level in an environment where all of the following are true:
- Programs are large, complex, and can legitimately do many things
(this is especially so because we are really talking about entire
assemblages of programs, not just single binaries). After all,
SELinux is intended to secure things like web servers, database
engines, and mailers, all of which have huge amounts of functionality.
- Programs legitimately access things that are spread all over the
system and intermingled tightly with things that they should not
be able to touch. This requires fine-grained selectivity about
what programs can and cannot access.
- Programs use and rely on outside libraries that can have unpredictable, opaque, and undocumented internal behavior, including about what resources those libraries access. Since we're trying to confine all of the program's observed behavior, this necessarily includes the behavior of the libraries that it uses.
All of this means that thoroughly understanding program behavior is very hard, yet such a thorough understanding is the core prerequisite for a SELinux policy that is both correct and secure. Even when you've got a thorough understanding once, the issue with libraries means that it can be kicked out from underneath you by a library update.
(Such insufficient understanding of program behavior is almost certainly the root cause of a great many of the SELinux issues that got fixed here.)
This complexity is inherent in trying to understand program behavior
in the unconfined environment of a general Unix system, where
programs can touch devices in /dev, configuration files under
/etc, run code from libraries in /lib, run helper programs from
/usr/bin, poke around in files in various places in /var/log
and /var, maybe read things from /usr/lib or /usr/share, make
network calls to various services, and so on. All the while they're
not supposed to be able to look at many things from those places
or do many 'wrong' operations. Your program that does DNS lookups
likely needs to be able to make TCP connections to port 53, but you
probably don't want it to be able to make TCP connections to port
25 (or 22). And maybe it needs to make some additional connections
to local services, depending on what NSS libraries got loaded by
glibc when it parsed /etc/nsswitch.conf.
(Cryptography libraries have historically done some really creative
and crazy things on startup in the name of trying to get some
additional randomness, including reading /etc/passwd and running
ps and netstat. Yes, really (via).)
SELinux can be simple, but it requires massive reorganization of a
typical Linux system and application stack. For example, life would
be much simpler if all confined services ran inside defined directory
trees and had no access to anything outside their tree (ie everything
was basically chroot()'d or close to it); then you could write
really simple file access rules (or at least start with them).
Similar things could be done with services provided to applications
(for example, 'all logging must be done through this interface'),
requirements to explicitly document required incoming and outgoing
network traffic, and so on.
(What all of these do is make it easier to understand expected program behavior, either by limiting what programs can do to start with or by requiring them to explicitly document their behavior in order to have it work at all.)
Sidebar: the configuration change problem
The problem gets much worse when you allow system administrators to substantially change the behavior of programs in unpredictable ways by changing their configurations. There is no scalable automated way to parse program configuration files and determine what they 'should' be doing or accessing based on the configuration, so now you're back to requiring people to recreate that understanding of program behavior, or at least a fragment of it (the part that their configuration changes affected).
This generously assumes that all points where sysadmins can change program configuration come prominently marked with 'if you touch this, you need to do this to the SELinux setup'. As you can experimentally determine today, this is not the case.
2016-05-25
SELinux is beyond saving at this point
SELinux has problems. It has a complexity problem (in that it is quite complex), it has technical problems with important issues like usability and visibility, it has pragmatic problems with getting in the way, and most of all it has a social problem. At this point, I no longer believe that SELinux can be saved and become an important part of the Linux security landscape (at least if Linux remains commonly used).
The fundamental reason why SELinux is beyond saving at this point is that after something like a decade of SELinux's toxic mistake, the only people who are left in the SELinux community are the true believers, the people who believe that SELinux is not a sysadmin usability nightmare, that those who disable it are fools, and so on. That your community narrows is what naturally happens when you double down on calling other people things; if people say you are an idiot for questioning the SELinux way, well, you generally leave.
If the SELinux community was going to change its mind about these
issues, the people involved have had years of opportunities to do
so. Yet the SELinux ship sails on pretty much as it ever has. These
people are never going to consider anything close to what I once
suggested in order to change course; instead, I
confidently expect them to ride the 'SELinux is totally fine' train
all the way into the ground. I'm sure they will be shocked and upset
when something like OpenBSD's pledge() is integrated either in Linux
libraries or as a kernel security module (or both) and people start
switching to it.
(As always, real security is people, not math. A beautiful mathematical security system that people don't really use is far less useful and important than a messy, hacky one that people do use.)
(As for why I care about SELinux despite not using it and thinking it's the wrong way, see this. Also, yes, SELinux can do useful things if you work hard enough.)
2016-05-04
The better way to clear SMART disk complaints, with safety provided by ZFS
A couple of months ago I wrote about clearing SMART complaints
about one of my disks by very carefully
overwriting sectors on it, and how ZFS made this kind of safe. In
a comment, Christian Neukirchen
recommended using hdparm --write-sector to overwrite sectors with
read errors instead of the complicated dance with dd that I used
in my entry. As it happens, that disk
coughed up a hairball of smartd complaints today, so I got a
chance to go through my procedures again and the advice is spot on.
Using hdparm makes things much simpler.
So my revised steps are:
- Scrub my ZFS pool in the hopes that this will make the problem go
away. It didn't, which means that any read errors in the partition
for the ZFS pool is in space that ZFS shouldn't
be using.
- Use
ddto read all of the ZFS partition. I did this with 'dd if=/dev/sdc7 of=/dev/null bs=512k conv=noerror iflag=direct'. This hit several bad spots, each of which produced kernel errors that included a line like this:blk_update_request: I/O error, dev sdc, sector 1748083315
- Use
hdparm --read-sectorto verify that this is indeed the bad sector:hdparm --read-sector 1748083315 /dev/sdc
If this is the correct sector,
hdparmwill report a read error and the kernel will log a failed SATA command. Note that is not a normal disk read, ashdparmis issuing a low-level read, so you don't get a normal message; instead you get something like this:ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata3.00: irq_stat 0x40000001 ata3.00: failed command: READ SECTOR(S) EXT ata3.00: cmd 24/00:01:73:a2:31/00:00:68:00:00/e0 tag 3 pio 512 in res 51/40:00:73:a2:31/00:00:68:00:00/00 Emask 0x9 (media error) [...]The important thing to notice here is that you don't get the sector reported (at least not in decoded form), so you have to rely on getting the sector number correct in the
hdparmcommand instead of being able to cross check it against earlier kernel logs.(Sector 1748083315 is 0x6831a273 in hex. All the bytes are there in the
cmdpart of the message, but clearly shuffled around.) - Use
hdparm --write-sectorto overwrite the sector, forcing it to be spared out:hdparm --write-sector 1748083315 <magic option> /dev/sdc
(
hdparmwill tell you what the hidden magic option you need is when you use--write-sectorwithout it.) - Scrub my ZFS pool again and then re-run the
ddto make sure that I got all of the problems.
I was pretty sure I'd gotten everything even before the re-scrub
and the re-dd scan, because smartd reported that there were no
more currently unreadable (pending) sectors or offline uncorrectable
sectors, both of which it had been complaining about before.
This was a lot easier and more straightforward to go through than
my previous procedure, partly because I can directly reuse the
sector numbers from the kernel error messages without problems and
partly because hdparm does exactly what I want.
There's probably a better way to scan the hard drive for read
errors than dd. I'm a little bit nervous about my 512Kb block
size here potentially hiding a second bad sector that's sufficiently
close to the first, but especially with direct IO I think it's a
tradeoff between speed and thoroughness. Possibly I should explore
how well the badblocks program works here, since it's the obvious
candidate.
(These days I force dd to use direct IO when talking to disks
because that way dd does much less damage to the machine's overall
performance.)
(This is the kind of entry that I write because I just looked up my first entry for how to do it again, so clearly I'm pretty likely to wind up doing this a third time. I could just replace the drive, but at this point I don't have enough drive bay slots in my work machine's case to do this easily. Also, I'm a peculiar combination of stubborn and lazy where it comes to hardware.)
2016-05-02
How I think you set up fair share scheduling under systemd
When I started writing this entry, I was going to say that systemd automatically does fair share scheduling between and describe the mechanisms that make that work. However, this turns out to be false as far as I can see; systemd can easily do fair share scheduling, but it doesn't do this by default.
The basic mechanics of fair share scheduling are straightforward.
If you put all of each user's processes into a separate cgroup it
happens automatically. Well. Sort of. You see,
it's not good enough to put each user into a separate cgroup; you
have to make it a CPU accounting cgroup, and a memory accounting
cgroup, and so on. Systemd normally puts all processes for a single
user under a single cgroup, which you can see in eg systemd-cgls
output and by looking at /sys/fs/cgroup/systemd/user.slice, but
by default it doesn't enable any CPU or memory or IO accounting for
them. Without those enabled, the traditional Linux (and Unix)
behavior of 'every process for itself' still applies.
(You can still use systemd-run to add your own limits here, but I'm not quite sure how this works
out.)
Now, I haven't tested the following, but from reading the documentation
it seems that what you need to do to get fair share scheduling for
users is to enable DefaultCPUAccounting and DefaultBlockIOAccounting
for all user units by creating an appropriate file in
/etc/systemd/user.conf.d, as covered in the systemd-user.conf
manpage
and the systemd.resource-control manpage.
You probably don't want to turn this on for system units, or at least
I wouldn't.
I don't think there's any point in turning on DefaultMemoryAccounting.
As far as I can see there is no kernel control that limits a cgroup's
share of RAM, just the total amount of RAM it can use, so cgroups
just can't enforce a fair share scheduling of RAM the way you can
for CPU time (unless I've overlooked something here). Unfortunately,
missing fair share memory allocation definitely hurts the overall
usefulness of fair share scheduling; if you want to insure that no
user can take an 'unfair' share of the machine, it's often just as
important to limit RAM as CPU usage.
(Having discovered this memory limitation, I suspect that we won't bother trying to enable fair share scheduling in our Ubuntu 16.04 installs.)