2015-05-30
What I'm doing in reaction to Logjam (for HTTPS, SSH, and IKE)
Logjam covers two weaknesses in Diffie-Hellman key exchanges: the ability to downgrade TLS sessions to use extremely weak DH key exchanges, and the potential ability to break DH key exchanges using common, known primes of 1024 bits and below. Logjam affects at least TLS, SSH, and IPSec's IKE protocol, all of which I use. Since Logjam was announced I've been working on figuring out what I can and should do in reaction to it, which in part involved looking at my situation and software, and I think I've come to answers now.
For TLS on my personal site I used the Logjam sysadmin recommendations to generate my own 2048-bit prime for DH key exchange. I haven't put their lighttpd cipher suite suggestion into place because I don't fully trust it to be at least as good as my current set and also every time I touch lighttpd's cipher selection it's a pain in the rear. Sometime I will switch to Apache and then I'll adopt whatever Mozilla's current SSL configuration recommendations are.
(My server already scores decently high on the Qualys SSL server test and doesn't have any export ciphers allowed.)
For SSH, things get more complicated. SSH is not subject to downgrade
attacks and OpenSSH already puts the vulnerable DH key exchange
algorithm last in the preference list, so the only connections
vulnerable to this issue are ones where either the client or the
server doesn't support anything else. Unfortunately we sort of have
some of those; Illumos aka Solaris SSH
is so old that it only supports the vulnerable algorithm and
'diffie-hellman-group-exchange-sha1', which uses custom moduli of
some size (hinted by the client, see RFC 4419). If I am reading 'ssh -vv'
debug output correctly, modern OpenSSH clients ask for and get DH
primes that are larger than 1024 bits even here, so we're safe. If
we're not safe, there's probably nothing I can do about it.
(The theoretically 'custom' moduli of d-h-g-e-sha1 may be relatively
standardized by mechanisms such as OpenSSH shipping a precreated
/etc/ssh/moduli, but if so I'm still safe since connections seem
to be using more than 1024 bits.)
So at the moment I don't intend to make any SSH configuration changes. I'm satisfied that any client connecting to our servers that can do better will be doing better, and I don't know if we have any remaining clients that literally can't manage anything better. My client connections will do better to any server where it's possible and I don't want to lose access to anything around here with a really terrible SSH server (most likely some embedded lights out management system from the dark ages).
The situation for IKE turns out to be similar. IKE is apparently not vulnerable to downgrade attacks in normal operation (at least in my setup) and Libreswan, the IKE software I'm using, defaults to using strong primes when talking to itself (per their FAQ answer, also). In this it turns out to be a good thing that I specifically insist on my configuration using IKEv2, which has stronger defaults. Since my only (current) use of IKE IPSec is for two strongly configured hosts to talk to each other, I don't need to do anything in order to keep being secure.
(We have a L2TP VPN server for our users. I suspect that it's already choosing strong defaults when it can, ie when clients support it, but we may need to do some configuration tweaks to it at some point. Disabling any current IKE modp1024 support is probably not viable because it runs the risk of cutting off clients in order to deal with a relatively low risk.)
2015-05-10
Our mail submission system winds up handling two sorts of senders
Yesterday I mentioned that while in
theory our mail submission system could use sender verification to
check whether a MAIL FROM address at an outside domain was valid,
but that I didn't feel this was worth it. One of the reasons I feel
this way is that I don't think this check will fail very often for
most outside domains, and to do that I need to talk about how we
have two sorts of senders: real people and machines.
Real people are, well, real people with a MUA who are sending email
out through us. My view is that when real people may send out email
using outside domains in their From: address, it's extremely
likely that this address will be correct; if it's not correct, the
person is probably going to either notice it or get told by people
they are trying to talk to through some out of band mechanism.
Unless you're very oblivious and closed off, you're just not going
to spend very long with your MUA misconfigured this way. On top of
it, real people have to explicitly configure their address in their
MUA, which means there is a whole class of problems that get avoided.
Machines are servers and desktops and everything we have sitting
around on our network that might want to send status email, report
in to its administrator, spew out error reports to warn people of
stuff, and so on. Email from these machines is essentially
unidirectional (it goes out from the machine but not back), may not
be particularly frequent, and is often more or less automatically
configured. All of this makes it very easy for machines to wind up
with bad or bogus MAIL FROMs. Often you have to go out of your
way during machine setup in order to not get this result.
(For instance, many machines will take their default domain for
MAIL FROMs from DNS PTR results, which malfunctions in the presence
of internal private zones.)
Most broken machine origin addresses are easily recognized, because they involve certain characteristic mistakes (eg using DNS PTR results as your origin domain). Many of these addresses cannot be definitively failed with sender verification because, for example, the machine doesn't even run a SMTP listener that you can talk to.
You can mostly use sender verification for addresses from real people, but even ignoring the other issues there's little point because they'll almost never fail. Real people will almost always be using sender addresses from outside domains, not from internal hostnames.
2015-05-09
What addresses we accept and reject during mail submission
Like many places, our mail setup
includes a dedicated mail submission machine (or two). I mentioned yesterday
that this submission machine refuses some MAIL FROM addresses,
so today I want to talk about what we accept and refuse during mail
submission and why.
When we were designing our mail submission configuration many years
ago, our starting point was that we didn't expect clients to deal
very well if the submission server gave them a failure response.
What you'd like is for the MUA to notice the error, report it, give
you a chance to re-edit the email addresses involved, and so on and
so forth. What we actually expected would happen would be some
combination of lost email, partially delivered email (if some RCPT
TOs failed but others succeeded), and awkward interfaces for dealing
with failed email sending. So a big guiding decision was that our
mail submission machine should accept the email if at all possible,
even if we knew that it would partially or completely fail delivery.
It was better to accept the email and send a bounce rather than
count on all of the MUAs that our users use to get it right.
(Some but not all RCPT TO addresses failing during SMTP is a
somewhat challenging problem for any MUA to deal with. How do you
present this to the user, and what do you want to do when the user
corrects the addresses? For example, if the user corrects the
addresses and resends, should it be resent to all addresses or just
the corrected ones? There's all sorts of UI issues involved.)
Given that our recovery method for bad destination addresses is
sending a bounce, we need to have what at least looks like a valid
MAIL FROM to send the bounce back to; if we don't we can't send
bounces, so we're better off rejecting during SMTP and hoping that
the MUA will do something sensible. For email addresses in outside
domains, the practical best we can do is verify that the domain
exists. For email addresses in our own domain, we can check that
the local part is valid (using our list of valid local parts), so we do.
(We also do some basic safety checks for certain sorts of bad
characters and bad character sequences in MAIL FROM and RCPT TO
addresses. These probably go beyond what the RFCs require and may
not be doing anything useful these days; we basically inherited
them from the stock Ubuntu configuration of close to a decade ago.)
We allow people to use MAIL FROM addresses that are not in our
domain in part because some people in the department have a real
need to do this as part of their work. In general we log enough
source information that if anyone abuses this we can find them and
deal with this.
(You might say 'but what about spammers compromising accounts and sending spam through you with forged origin addresses?' My answer is that that's a feature.)
PS: In theory checking outside domain MAIL FROM addresses is one
place where sender verification has a real justification, and you
can even legitimately use the null sender address for it. In practice there are all sorts of
failure modes that seem likely to cause heartburn and it's just not
worth it in my opinion.
2015-05-08
Sometimes it's useful to have brute force handy: an amusing IPMI bug
Once upon a time we had gotten in some new servers. These servers
had an IPMI and the IPMI could be configured to send out email
alerts if something happened, like a fan stopping or a power supply
losing power. Getting such alerts (where possible) seemed like a
good idea, so I dutifully configured this in the IPMI's web interface.
Sensibly, the IPMI needed me to set the origin address for the
email, so I set it to sm-ipmi@<us> (and then made sure there was
an sm-ipmi alias, so our mail submission machine would accept the
email).
Of course, configurations are never quite done until they're tested.
So I poked the IPMI to send me some test email. No email arrived.
When I went off to our mail submission machine to look at its logs,
I got rather a surprise; the logs said the machine had dutifully
rejected a message that claimed a MAIL FROM address of
=sm-ipmi@<us>.
While the insides of an IPMI's embedded software are inscrutable
(at least to lazy sysadmins who are not security researchers), this
smells like some form of classic data storage mismatch bug. The web
interface thinks the email address should be stored with an '=' in
front, maybe as an 'X=Y' thing, whereas whatever is actually using
the address either has an off by one character parsing bug or doesn't
want the extra leading = that the web interface is adding when
it stores it.
There are probably a bunch of ways we could have dealt with this.
As it happens our mail system is flexible enough to let us do the
brute force approach: we just defined an alias called '=sm-ipmi'.
Our mail system is willing to accept an '=' in local parts, even
at the start, so that's all it took to make everything happy. It
looks a little bit peculiar in the actual email messages, but that's
just a detail.
A more picky email system would have given us more heartburn here.
In a way we got quite lucky that none of the many levels of checks
and guards we have choked on this. Our alias generation system was
willing to see '=' as a valid character, even at the start; the
basic syntax checks we do on MAIL FROM didn't block a = at the
start; Exim itself accepts such a MAIL FROM local part and can
successfully match it against things. I've used mail systems in the
past that were much more strict about this sort of stuff and they'd
almost certainly have rejected such an address out of hand or at
least given us a lot of trouble over it.
(I don't even know if such an address is RFC compliant.)
The whole situation amuses me. The IPMI has a crazy, silly bug that should never have slipped through development and testing, and we're dealing with it by basically ignoring it. We can do that because our mail system is itself willing to accept a rather crazy local part as actually existing and being valid, which is kind of impressive considering how many different moving parts are involved.
PS: I call this the brute force solution because 'make an alias with a funny character in it' is more brute force than, say, 'figure out how to use sender address rewriting to strip the leading = that the IPMI is erroneously putting in there'.
PPS: Of course, some day maybe we'll update the IPMI firmware and
suddenly find the notification mail not going through because the
IPMI developers noticed the bug and fixed it. I suppose I should
add the 'sm-ipmi' alias back in, just in case.
2015-05-04
Monitoring tools should report timestamps (and what they're monitoring)
This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote some simple tools to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this:
40.33 MB/s RX 56.54 MB/s TX packets/sec: 50331 RX 64482 TX
I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me.
Recently we've been looking at a NFS fileserver lockup problem, and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles.
So now I have a version of these tools with a somewhat different output format:
e1000g1 23:10:08 14.11 MB/s RX 77.40 MB/s TX packets/sec: 37791 RX 66359 TX
Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines).
I was lucky here in that adding a timestamp plus typical interface
names still keep output lines under 80 characters. But even in cases
where adding this information would widen the output lines, well,
I can widen my xterm windows and it's better to have this information
than to have to reconstruct it afterwards. So in the future I think
all of my monitoring tools are at least going to have an option to
add a timestamp and similar information, and they might print it all
the time if it fits (as it does here).
PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry.
PPS: This idea is not a miracle invention of mine by any means. In
fact I shamelessly copied it from how useful the timestamps printed
out by tools like arcstat are. When I noticed how much I was using
those timestamps and how nice it was to be able to scroll back,
spot something odd, and say 'ah, this happened at ...' right away,
I smacked myself in the forehead and did it for all of the monitoring
commands I was using. Fortunately many OmniOS commands like vmstat
already have an option to add timestamps, although it's sometimes
kind of low-rent (eg vmstat prints the timestamp on a separate
line, which doubles how many lines of output it produces and thus
halves the effective size of my scrollback buffer).
2015-05-03
Sometimes knowing causes does you no good (and sensible uses of time)
Yesterday, I covered our OmniOS fileserver problem with overload and mentioned that the core problem seems to be (kernel) memory exhaustion. Of course once we'd identified this I immediately started coming up with lots of theories about what might be eating up all the memory (and then not giving it back), along with potential ways to test these theories. This is what sysadmins do when we're confronted with problems, after all; we try to understand them. And it can be peculiarly fun and satisfying to run down the root cause of something.
(For example, one theory is 'NFS TCP socket receive buffers', which would explain why it seems to need a bunch of clients all active.)
Then I asked myself an uncomfortable question: was this going to actually help us? Specifically, was it particularly likely to get us any closer to having OmniOS NFS fileservers that did not lock up under surges of too-high load? The more I thought about that, the more gloomy I felt, because the cold hard answer is that knowing the root cause here is unlikely to do us any good.
Some issues are ultimately due to simple and easily fixed bugs, or turn out to have simple configuration changes that avoid them. It seems unlikely that either are the case here; instead it seems much more likely to be a misdesigned or badly designed part of the Illumos NFS server code. Fixing bad designs is never a simple code change and they can rarely be avoided with configuration changes. Any fix is likely to be slow to appear and require significant work on someone's part.
This leads to the really uncomfortable realization that it is probably not worth spelunking this issue to explore and test any of these theories. Sure, it'd be nice to know the answer, but knowing the answer is not likely to get us much closer to a fix to a long-standing and deep issue. And what we need is that fix, not to know what the cause is, because ultimately we need fileservers that don't lock up every so often if things go a little bit wrong (because things go a little bit wrong on a regular basis).
This doesn't make me happy, because I like diagnosing problems and finding root causes (however much I gripe about it sometimes); it's neat and gives me a feeling of real accomplishment. But my job is not about feelings of accomplishment, it's about giving our users reliable fileservice, and it behooves me to spend my finite time on things that are most likely to result in that. Right now that does not appear to involve diving into OmniOS kernel internals or coming up with clever ways to test theories.
(If we had a lot of money to throw at people, perhaps the solution would be 'root cause the problem then pay Illumos people to do the kernel development needed to fix it'. But we don't have anywhere near that kind of money.)