A problem with using old OmniOS versions: disconnection from the community
One of the less obvious problems with us probably never doing another OmniOS upgrade is that I'm clearly going to become more and more disconnected from the OmniOS community. This is only natural, since most or almost all of the community is using recent versions; as time goes on, those versions and the version we're running are only going to drift more and more apart.
(It's true that OmniOS r151014 is an OmniOS LTS release, supported through early 2018 per here. But in practice I expect that most OmniOS people will be running the one of the more up to date stable releases instead, since they won't have our upgrade concerns.)
Being disconnected from the community makes me sad, because the OmniOS community is one of the great parts about OmniOS. There are several dimensions to this disconnection. First, the more disconnected I am from the community, the less I'll be able to give back to it, the less I can contribute answers or information or whatever. Giving back to the community is something that I would like to do for all sorts of reasons (including that I plain like being able to contribute).
Obviously, the more distant we are from what the community is running the less the community can help us with advice and information and all of that if we run into issues or just have questions about how best to do something or what the community's experiences are. At best they may be able to tell us how things would look or would be done on a newer version of OmniOS. Of course, some things only change slowly, but I suspect that there is only going to be more and more of a gap here over time. I don't want to put too much weight on this; I'm very grateful to the help that the community has given us, but at the same time it's not help that I think we should count on and significantly factor into our plans.
(To put it one way, community help comes from the goodness of its heart and is best considered a pleasant surprise instead of a guarantee or an entitlement. I don't know if all of this makes sense to anyone but me, though.)
Finally, I'll just plain be paying less attention to the community and drifting away it. It's inevitable; more and more, community discussions will be about things that aren't relevant to our version and that I can't contribute to. If people have problems or questions, I'll only have outdated information or more and more uninformed opinions. That's a recipe for disengagement, even from a nice community.
Having written all of this, I think that what I should do is build one experimental OmniOS server to keep up to date. It doesn't have to use our fileserver hardware; for a lot of things, any old server running OmniOS will serve to keep me at least somewhat current. As a bonus it will provide me with a platform to test things on the current OmniOS version (whatever that is at the time).
(We have enough spare SSDs for our current fileservers so that I could take the test fileserver and build a system SSD set for the current OmniOS, just so I have it around. We did this sort of back and forth OmniOS version testing during our transition to r151014, so we actually have a template for it.)
Your overall anti-spam system should have manual emergency blocks
We mostly rely on a commercial anti-spam system for our incoming spam filtering (as described here), and many other people rely on a variety of open source options for their spam filtering. This generally works very well, with us (and you) getting to offload the work of maintaining a high quality anti-spam system to other people (and it's certainly a lot of work). But not always (and not just because it malfunctions). The realities of life are that sooner or later you will be hit by a spam run that your anti-spam system doesn't recognize, either because the spam run is really new or because it's pretty specific to you.
Much of the time, you can shrug your shoulders and let this go. No anti-spam system is perfect and one of the tradeoffs you make when relying on a third-party system is that it's broadly out of your hands (sometimes this is an advantage). But some of the time this isn't going to be good enough; either the volume or the threat to your users will be so high that you can't just sit on your hands.
(Modern ransomware is making this clear by creating a potentially very high cost of allowing some things through.)
When this day comes to pass, you'll want to have the ability to step in and block the traffic even though your automated anti-spam system is happy with it. This can take many forms, depending on how you want to handle it; you could figure out how to write custom rules for your anti-spam system (so you can outright block certain sorts of files or certain URLs or whatever), or you can build blocking features into your mailer configuration itself, or any number of other options.
Having been through having to do this on the fly during an emergency, my strong suggestion is that you build the infrastructure for these manual blocks now, before you need them. It's some additional up front work and if you're lucky you may never need it, but doing it now when you have time to plan and test and figure out the best way to do things beats having to do it on the fly, under pressure.
Sidebar: What I think you should have manual blocks for
On the one hand attacker ingenuity is very deep, but on the other hand certain patterns repeat over and over again. So my view is that you can probably cover most ground with the ability to put in place manual blocks against sending IPs, sending domains, file extensions (including inside file containers like ZIP files), and whole and partial URLs (for phishing campaigns). You might also want a general message header and body regular expression matching system, but that's starting to feel like scope creep to me.
(Of course real scope creep would be to start by creating a general, generic framework for writing relatively arbitrary manual blocks on message attributes.)
Why SELinux is inherently complex
The root of SELinux's problems is that SELinux is a complex security mechanism that is hard to get right. Unfortunately this complexity is not (just) simply an implementation artifact of the current SELinux code; instead, it's inherent in what SELinux is trying to do.
What SELinux is trying to do is understand 'valid' program behavior and confine programs to it at a fine grained level in an environment where all of the following are true:
- Programs are large, complex, and can legitimately do many things
(this is especially so because we are really talking about entire
assemblages of programs, not just single binaries). After all,
SELinux is intended to secure things like web servers, database
engines, and mailers, all of which have huge amounts of functionality.
- Programs legitimately access things that are spread all over the
system and intermingled tightly with things that they should not
be able to touch. This requires fine-grained selectivity about
what programs can and cannot access.
- Programs use and rely on outside libraries that can have unpredictable, opaque, and undocumented internal behavior, including about what resources those libraries access. Since we're trying to confine all of the program's observed behavior, this necessarily includes the behavior of the libraries that it uses.
All of this means that thoroughly understanding program behavior is very hard, yet such a thorough understanding is the core prerequisite for a SELinux policy that is both correct and secure. Even when you've got a thorough understanding once, the issue with libraries means that it can be kicked out from underneath you by a library update.
(Such insufficient understanding of program behavior is almost certainly the root cause of a great many of the SELinux issues that got fixed here.)
This complexity is inherent in trying to understand program behavior
in the unconfined environment of a general Unix system, where
programs can touch devices in
/dev, configuration files under
/etc, run code from libraries in
/lib, run helper programs from
/usr/bin, poke around in files in various places in
/var, maybe read things from
network calls to various services, and so on. All the while they're
not supposed to be able to look at many things from those places
or do many 'wrong' operations. Your program that does DNS lookups
likely needs to be able to make TCP connections to port 53, but you
probably don't want it to be able to make TCP connections to port
25 (or 22). And maybe it needs to make some additional connections
to local services, depending on what NSS libraries got loaded by
glibc when it parsed
(Cryptography libraries have historically done some really creative
and crazy things on startup in the name of trying to get some
additional randomness, including reading
/etc/passwd and running
netstat. Yes, really (via).)
SELinux can be simple, but it requires massive reorganization of a
typical Linux system and application stack. For example, life would
be much simpler if all confined services ran inside defined directory
trees and had no access to anything outside their tree (ie everything
chroot()'d or close to it); then you could write
really simple file access rules (or at least start with them).
Similar things could be done with services provided to applications
(for example, 'all logging must be done through this interface'),
requirements to explicitly document required incoming and outgoing
network traffic, and so on.
(What all of these do is make it easier to understand expected program behavior, either by limiting what programs can do to start with or by requiring them to explicitly document their behavior in order to have it work at all.)
Sidebar: the configuration change problem
The problem gets much worse when you allow system administrators to substantially change the behavior of programs in unpredictable ways by changing their configurations. There is no scalable automated way to parse program configuration files and determine what they 'should' be doing or accessing based on the configuration, so now you're back to requiring people to recreate that understanding of program behavior, or at least a fragment of it (the part that their configuration changes affected).
This generously assumes that all points where sysadmins can change program configuration come prominently marked with 'if you touch this, you need to do this to the SELinux setup'. As you can experimentally determine today, this is not the case.
SELinux is beyond saving at this point
SELinux has problems. It has a complexity problem (in that it is quite complex), it has technical problems with important issues like usability and visibility, it has pragmatic problems with getting in the way, and most of all it has a social problem. At this point, I no longer believe that SELinux can be saved and become an important part of the Linux security landscape (at least if Linux remains commonly used).
The fundamental reason why SELinux is beyond saving at this point is that after something like a decade of SELinux's toxic mistake, the only people who are left in the SELinux community are the true believers, the people who believe that SELinux is not a sysadmin usability nightmare, that those who disable it are fools, and so on. That your community narrows is what naturally happens when you double down on calling other people things; if people say you are an idiot for questioning the SELinux way, well, you generally leave.
If the SELinux community was going to change its mind about these
issues, the people involved have had years of opportunities to do
so. Yet the SELinux ship sails on pretty much as it ever has. These
people are never going to consider anything close to what I once
suggested in order to change course; instead, I
confidently expect them to ride the 'SELinux is totally fine' train
all the way into the ground. I'm sure they will be shocked and upset
when something like OpenBSD's
pledge() is integrated either in Linux
libraries or as a kernel security module (or both) and people start
switching to it.
(As always, real security is people, not math. A beautiful mathematical security system that people don't really use is far less useful and important than a messy, hacky one that people do use.)
(As for why I care about SELinux despite not using it and thinking it's the wrong way, see this. Also, yes, SELinux can do useful things if you work hard enough.)
How fast fileserver failover could matter to us
Our current generation fileservers don't have any kind of failover system, just like our original generation. A few years ago I wrote that we don't really miss failover, although I allowed that I might be overlooking situations where we'd have used failover if we had it. So, yeah, about that: on reflection, I think there is a relatively important situation where we could really use fast, reliable cooperative failover (when both the old and new hosts of a virtual fileserver are working properly).
Put simply, the advantage of fast cooperative failover is that it makes a number of things a lot less scary, because you can effectively experiment (assuming that the failover is basically user transparent). For instance, trying a new version of OmniOS in production, where it's very unlikely that it will crash outright but possible that we'll experience performance problems or other anomalies. With fast failover, we could roll a virtual fileserver on to a server running the new OmniOS, watch it, and have an immediate and low impact way out if something comes up.
(At one point this would have made our backups explode because the backups were tied to the real hosts involved. However we've changed that these days and backups are relatively easy to shift around.)
It's possible that we should take another look at failover in our current environment, since a lot of water has gone under the bridge since we last gave up on it. This also sparks a more radical thought; if we're going to use failover mostly as a way to do experiments, perhaps we should reorganize things so that some of our virtual fileservers are smaller than they are now so moving one over affects fewer people. Or at least we could have a 'sysadmin virtual fileserver' so we can test it with ourselves only at first.
(Our current overall architecture is sort of designed with the idea that a host has only one virtual fileserver and virtual fileservers don't really share disks with other ones, but we might be able to do some tweaks.)
All of this is a bit blue sky, but at the very least we should do a bit of testing to see how much time a cooperative fileserver failover might take in our current environment. I should also keep an eye out for future OmniOS changes that might improve it.
(As usual, re-checking one's core assumptions periodically is probably a good idea. Ideally we would have done some checking of this when we were initially testing each OmniOS version, but well. Hopefully next time, if there is one.)
Our problem with OmniOS upgrades: we'll probably never do any more
Our current fileserver infrastructure is currently running OmniOS r151014, and I have recently crystallized the realization that we will probably not upgrade it to a newer version of OmniOS over the remaining lifetime of this generation of server hardware (which I optimistically project to be another two to three years). This is kind of a problem for a number of reasons (and yes, beyond the obvious), but my pessimistic view right now is that it's an essentially intractable one for us.
The core issue with upgrades for us is that in practice they are extremely risky. Our fileservers are a core and highly visible service in our environment; downtime or problems on even a single production fileserver directly impacts the ability of people here to get their work done. And we can't even come close to completely testing a new fileserver outside of production. Over and over, we have only found problems (sometimes serious ones) under our real and highly unpredictable production load.
(We can do plenty of fileserver testing outside of production and we do, but testing can't show that production fileservers will be problem free, it can only find (some) problems before production.)
Since upgrades are risky, we need fairly strong reasons to do them. When our existing fileservers are working reasonably well, it's not clear where such strong reasons would come from (barring a few freak events, like a major ixgbe improvement, or the discovery of catastrophic bugs in ZFS or NFS service or the like). On the one hand this is a testimony to OmniOS's current usefulness, but on the other hand, well.
I don't have any answers to this. There probably really aren't any, and I'm wishing for a magic solution to my problems. Sometimes that's just how it goes.
(I'm assuming for the moment that we could do OmniOS version upgrades through new boot environments. We might not be able to, for various reasons (we couldn't last time), in which case the upgrade problem gets worse. Actual system reinstalls, hardware swaps, or other long-downtime operations crank the difficulty of selling upgrades up even more. Our round of upgrades to OmniOS r151014 took about six months from the first server to the last server, for a whole collection of reasons including not wanting to do all servers at once in case of problems.)
My view of Barracuda's public DNSBL
In a comment on this entry, David asked, in part:
Have you tried the Barracuda and Hostkarma DNSBLs? [...]
I hadn't heard of Hostkarma before, so I don't have anything to say about it. But I am somewhat familiar with Barracuda's public DNSBL and based on my experiences I'm not likely to use it any time soon. As for why, well, David goes on to mention:
[...] Barracuda in particular lists more aggressively and is willing to punish lower volume relays that fail to mitigate spammer exploitations. [...]
That's one way to describe what Barracuda does. Another way to put it is that in my experience, Barracuda is pretty quick to list any IP address that has even a relatively brief burst of outgoing spam, regardless of the long term spam-to-ham ratio of that IP address. Or to put it another way, whenever we have one of our rare outgoing spam incidents, we can count on the outgoing IP involved to get listed and for some amount of our entirely legitimate email to start bouncing as a result.
As a result I expect that any attempt to use it in our anti-spam system would have far too high a false positive rate to be acceptable to our users. Given this I haven't attempted any sort of actual analysis of comparing sender IPs of accepted and rejected email against the Barracuda list; it's too much work for too little return.
My suspicion is that this is likely to be strongly influenced by your overall rate of ham to spam, for standard mathematical reasons. If most of your incoming email is spam anyways and you don't often receive email from places that are likely to be compromised from time to time by spammers, its misfires are not likely to matter to you. This does not describe our mail environment, however, either in ham/spam levels or in the type of sources we see.
(To put it one way, universities are reasonably likely to get one of their email systems compromised from time to time and we certainly get plenty of legitimate email from universities.)
On my personal sinkhole spamtrap, I could probably use the Barracuda list (and the psky RBL) as a decent way of getting rid of known and thus probably uninteresting source of spam in favour of only having to deal with (more) interesting ones. But obviously this spamtrap gets only spam, so false positives are not exactly a concern. Certainly a significant number of recently trapped messages there are from IPs that are on one or the other lists (and sometimes both), although obviously I'm taking a post-facto look at the hit rate.
Please stop the Python 2 security scaremongering
Let's start with Aaron Meurer's Moving Away from Python 2 in which I read, in passing:
- Python 2.7 support ends in 2020. That means all updates, including security updates. For all intents and purposes, Python 2.7 becomes an insecure language to use at that point in time.
There is no nice way to put it: this is security scaremongering.
It is security scaremongering for three good reasons. First, by 2020 Python 2.7 is very likely to be an extremely stable piece of code that has already been picked over heavily for security issues. Even today Python 2.7 security issues are fairly rare, and we still have four more years for people to apply steadily improving analysis and fuzzing tools to Python 2.7 to find anything left. As such, the practical odds that people will find any significant security issues in Python 2.7 after it stops being supported seems fairly low.
Second, it is not as if Python 2.7 will be unsupported in 2020. Oh, sure, the main Python team will not support it, but there are plenty of OS vendors (especially Linux vendors) that either do have or likely will have supported OS versions with officially supported Python 2.7 versions. These vendors themselves are going to fix any security issues found in 2.7. As 2020 approaches, it's very likely that you'll be using a vendor version of 2.7 and so be covered by their security teams. If you're building 2.7 yourself, well, you can copy their work.
(By the way, this means that a bunch of security teams have a good motive to fuzz and attack Python 2.7 now, while the Python core team will still fix any problems they find.)
Finally, a potentially significant amount of Python code is not even running in a security sensitive setting in the first place. If your Python code is processing trusted input in a trusted environment, any potential security issues in Python 2.7 are basically irrelevant. Not all Python code is running websites, to put it one way.
To imply that using Python 2.7 after support ends in 2020 will immediately endanger people is scaremongering. The reality is that it's extremely likely that Python 2.7 after 2020 will be just as secure and stable as it was before 2020, and it's very likely that any issues found after 2020 will be promptly fixed by OS vendors.
(A much more likely security issue with Python 2.7 even before 2020 is framework, library, and package authors abandoning all support for 2.7 versions of their code. If Django is no longer getting security fixes on 2.7, it doesn't really matter that the CPython interpreter itself is still secure.)
By the way, I'm entirely neglecting alternate Python implementations here. These have historically targeted Python 2, not Python 3, and their status of supporting Python 3 (only) is often what you could call 'uncertain'. It seems entirely possible that, say, PyPy might wind up supporting Python 2.7.x well after the main CPython team drops support for it, and of course PyPy would likely fix any security issues that were uncovered in their implementation.
Sidebar: Vendor support periods and Python 2.7
In already released Linux distributions, Ubuntu 16.04 LTS has just been released with Python 2.7.11; it will be supported for five years, until April 2021 or so. Red Hat Enterprise Linux 7 (and CentOS 7) has Python 2.7 and will be supported until midway through 2024 (cf).
(Which version of Python 2.7 RHEL 7 has is sort of up in the air. It is officially '2.7.5', but it has additional RHEL patches and RHEL does backport security fixes as needed and so on.)
In future releases, it seems pretty likely that Ubuntu will release 18.04 LTS in April 2018, it will come with a fully supported Python 2.7, and be supported for five years, through 2023. Red Hat will probably release a new version of RHEL before 2020, will likely include Python 2.7, and if so will be supporting it for ten years from the release, which will take practical 2.7 support well into the late 2020s.
Some notes on abusing the
pexpect Python module
What you are theoretically supposed to use pexpect for is to have your program automatically interact with interactive programs. When they produce certain sorts of output, you recognize it and take action; when you see prompts, you can automatically answer them. Pexpect is often used this way to automate things that expect to be operated manually by a real person. This is not what I'm using pexpect for. What I'm using it for is to start a program in what it thinks is an interactive environment, capture its output if all goes well, and if things go wrong allow a human operator to step in and interact with the program (all the while still capturing the output). This means that I'm ignoring almost all of pexpect's functionality and abusing parts of the rest in ways that it was probably not designed for.
Before I start, I need to throw in a disclaimer. There are multiple versions of pexpect out there; my impression is that development stalled for a while and then picked up recently. As I write this, the pexpect documentation talks about 4.0.1, but what I've used is no later than 3.1. Pexpect 4 may fix some of the issues I'm going to grumble about.
Supposing that my case is what you want to do, you start out by spawning a command:
child = pexpect.spawn(YOURCOMMAND, args=args, timeout=None)
It's important to set a timeout of
None as the starting timeout.
If you want to have a timeout at all, for example to detect that
the remote end has gone silent, you want to control it on a call
by call basis.
Now you want to collect output from the child command:
res =  while not child.closed and child.isalive(): try: r = child.read_nonblocking(size = 16*1024, timeout=YOURTIMEOUT) res.append(r) except pexpect.EOF: # expected, just stop break except pexpect.TIMEOUT: # do whatever you want to recover return recover_child(child, res)
You might as well set
size to large here. Although the documentation
doesn't tell you this, it is just the maximum amount of data your
read can ever return; it doesn't block until that much data is
available. My principle is 'if the command generates a lot of output,
let's read it in big blocks'.
We're not done once pexpect has raised an EOF. We need to do some cleanup to make sure that the child's exit status is available:
# Some of this is probably superstition if not child.closed and child.isalive(): child.wait() return (res, child.status)
Pexpect 3.1's documentation is not entirely clear on what you have
to check when in order to see if the child is alive or not. Note
.isalive() has the (useful) side effect of harvesting the
child's exit status if the child is not alive. It's helpfully not
valid to call
.wait() on a dead child, at least in 3.1, so you
have to check carefully first.
As pexpect documents, it splits the actual OS process exit status
child.signalstatus (and various things
return one or the other). The whole status is available as
child.status, but you may find one or the other variant more
useful (for example if you're really only interested in 'did the
command exit with status 0 or did something go boom').
Allowing the user to interact with the child is somewhat more
involved. Fundamentally we call
but there is a bunch of things that you need to do around this.
def talkto(child): # Set up to log interactive output res =  def save_output(data): if data: res.append(data) return data while not child.closed and child.isalive(): try: child.interact(output_filter=save_output) except OSError as e: # Usually an EOF from the command. # Complain somehow. break # If the child is alive here, the user has # typed a ^] to escape from interact(). # What happens next is up to you.
Yes, you read that right. Uniquely, pexpect's
does not raise
pexpect.EOF on EOF from the child; instead it
generally passes through an underlying
OSError that it got (my
notes don't say what that OSError usually is). In general, if you
OSError here you have to assume that the session is dead,
although pexpect doesn't necessarily know it yet.
child.interact() sets things up so that control characters
and so on that the user types are normally passed through directly
to the child process instead of affecting your Python program. This
means that under normal circumstances, if you type eg ^C your Python
code won't get hit with a
SIGINT; it'll go through to the child
program and the child program will do whatever it does in reaction.
What you do if the user chooses to use ^[ to exit from
is up to you. Note that you can allow them to resume the interaction;
just go back through your loop to call
If you allow the user to abandon the child and exit your
function (you probably want to), you need to do some more cleanup
of the child:
# after interact() returns, try to # read anything left over, then close the child. try: r = child.read_nonblocking(size=128*1024, timeout=0) res.append(r) except (pexpect.EOF, pexpect.TIMEOUT, OSError): pass child.close(force=True)
read_nonblocking with_timeout=0_ means what you think
it does; it's a non-blocking read of whatever (final) data is
available right now, with no waiting for anything more to come in
from the child.
At least in pexpect 3.1, you basically should call
force=True or you will get a pexpect error if the child stays
alive, which it may. Setting
force winds up hitting the child
SIGKILL if nothing else seems to work, which is relatively
(Although the documentation doesn't mention it, if the child is
alive it always gets sent
SIGHUP and then
SIGINT first. Well,
this happens in older versions of pexpect; the 4.0.1 code is a bit
different and I haven't dug through it.)
Possibly there is a better Python module for this sort of interaction in general. If so, it is too late for me; I've already written all of this code and I hope to not have to touch it again before we have to port it to Python 3 (if ever).
(My impression is that you should try to use pexpect 4 if you can, as the code has been overhauled and the documentation at least somewhat improved.)
Some basic data on the hit rate of the Spamhaus DBL here
After my previous exploration of the Spamhaus DBL, I wound up adding it as another DNS blocklist in our overall spam filtering setup. Because we don't have a mandate for it, none of our DNS blocklists apply to all email, only to email for people who have opted in to some amount of server side spam filtering. Because the DBL applies on a per-recipient basis, the comparison I'm going to use here is against the overall recipient count (not the overall message count). I'm also going to use the past nine days, so I can sort of compare this to my estimated hit rate.
So, over the past nine days, we have had:
- 106,837 accepted
MAIL FROMs and 106,835 accepted
RCPT TOs, which means that almost all of our accepted messages have been delivered to a single destination address.
- 29,194 accepted
RCPT TOs for IPs listed in one of the Spamhaus DNSBLs. Since these were accepted, these are recipients who have not opted into any amount of our server-side spam filtering.
- 7,685 accepted
RCPT TOs for domains listed in the DBL. A quick check suggests that about 6,390 of these came from IP addresses that were in the Spamhaus DNSBLs.
RCPT TOs that were rejected because the sender IP was in one of the Spamhaus DNSBLs. This is checked before the DBL.
- Only 346
RCPT TOs that were rejected because the sender domain was in the DBL.
On the one hand, this doesn't look too great for the DBL; despite my initial estimate, we aren't getting many rejections from checking the DBL. On the other hand, when I look at the source addresses of those rejections, something jumps out right away: just over half of them come from one system.
Specifically, over half of them come from the mail server for another (sub)domain on campus, one where a number of our users have accounts and forward (all of) their email from that system to us. What we've effectively done with the DBL is to add an additional SMTP-time defense to reject forwarded spam. In fact there are a number of 'forwarded from another campus mail system' DBL rejections in the past nine days from other sources.
My personal view is that these rejections are valuable ones (partly because I've observed our commercial anti-spam system not doing so well with forwarded spam in the past). So on the whole I'm happy with what the DBL is doing here, and also happy that now I have better numbers on what it could be doing if more people opted in to server-side spam filtering.
(Despite my bright words here, I'm also disappointed that adding
the DBL isn't rejecting more messages. I guess this is partly down
to how a lot of spam with DBL domains comes from IPs that are already
blocked on their own. Note that we're using the DBL in its most
basic and limited mode, where we check it against the
domain; you're really supposed to use it to check domains mentioned
in the body of email messages.)