Wandering Thoughts archives

2015-07-05

Sysadmin use of email is often necessarily more or less interrupt driven

One of the things that people commonly do with virtual screens is to put their email client off in a separate virtual screen so they can ignore it and avoid having it interrupt them. As I mentioned when I wrote up my virtual screen usage, I don't try to cordon off email this way. Fundamentally this is because as a sysadmin, I feel my use of email is necessarily interrupt driven.

Allowing email to interrupt me certainly can derail my chain of thought when I'm coding or working on a hard problem. But at the same time it's absolutely necessary, because that email may carry news of an emergency or a high priority issue that I need to handle more or less right away. I almost never have the option of ignoring even the possibility of such things, so almost all of the time I have to allow email to interrupt me. The best I can do is contrive low distraction email monitoring so that when I'm in a flow state it distracts me as little as possible.

So I can stop there, right? No, not so fast. What this really argues is that email is a bad way of receiving high priority information like alerts. Because it mixes high priority information with much less important messages, I have to allow even unimportant things to interrupt me at least a bit just so I can figure out whether or not I can ignore them. If alerts and priority items came in through another channel, I could readily ignore email during high focus times.

(There are always going to be days where all I do is fiddle around with stuff and swat things as they come up; on those days, I'd read and handle everything right away.)

Of course the problem is that there is no good other channel today, at least for us. Oh, with work you can build such a thing (possibly outsourcing parts of it to companies who specialize in it), but there's very little in the way of a canned out of the box solution. Plus there's the problem of getting people use your new 'urgent things' channel when they have an urgent thing and of course not using it when they don't have an urgent thing (with the associated issue of having people know whether or not their issue is urgent).

(Life is likely somewhat easier if you can assume that everyone has a smartphone, perhaps by issuing them one, but that is not something that's true in our environment.)

InterruptDrivenEmail written at 02:51:58; Add Comment

2015-06-09

How I use virtual screens in my desktop environment

Like many people, I use a (Unix) desktop environment that supports what gets called 'virtual screens' or 'virtual desktops' (my window manager actually has both). One common approach for using them is to dedicate particular virtual screens to particular programs or groups of programs based on your purpose (this goes especially well with full screen apps). You might have one virtual screen more or less devoted to your mail client, one to your editor or IDE, one to status monitoring of your systems, and so on.

(If you have multiple displays or a big enough display that filling all of it with eg your mail client is absurd, you might reserve the remaining space on your mail desktop as 'overflow' space for browser windows or whatever that you need in the course of dealing with your mail.)

This is not how I've wound up using my virtual screens, at least most of the time. Instead (as you can tell from the nearly empty additional virtual screens on my desktop) I use them primarily as overflow space. Almost all of the time I'm using what I consider my primary virtual screen (the top left one) and everything goes in it. However, some of the time I'm trying to do too many space-consuming things at once, or I just want a cleaner place to do something (often something big), one without all of the usual clutter on my primary virtual screen. That's when I use my additional virtual screens; I switch to a new one, possibly drag some amount of existing windows over from the primary screen, and go for it.

(I'm especially likely to do this if what I want to do is both going to last for a while and take up a bunch of screen space with its window or windows.)

My virtual screens are arranged in a 3 wide by 2 deep grid. Since the easiest screens to use are the ones right by the top left primary screen, the virtual screens one down and one over from it are the usual overflow or temporary work targets. However space consuming long lived stuff tends to get put one screen further away (the screen one down and one over), because this way I keep the more convenient closer screens free for other stuff.

(When we were chasing our OmniOS NFS overload problem, I wound up carpeting this virtual screen with xterms that were constantly running vmstat and so on. I wasn't paying any attention to them until the server locked up, but my use of an xterm feature meant that I couldn't just iconify them. Anyways, leaving them open made them easier to keep track of and tell apart, partly because I'm big on using spatial organization for things.)

I've found it quite handy to have a mouse binding that flips between the current virtual screen and the previous one. That way I can rapidly flip between my primary screen and whatever other virtual screen I'm doing things on. In practice this makes it a lot more convenient to use another virtual screen, because I wind up flipping back to the primary screen for a lot of stuff.

(I often flip back even for stuff that I could do on the new virtual screen just because I 'know' that eg I read mail on the primary screen. I justify this as an anti-distraction measure in that the non-primary screen should not be used for things unrelated to its purpose.)

I have a small amount of things that are permanently there but I don't interact with or look at regularly. These things get exiled off to the very furthest away virtual screen. Typical occupants of this screen are iconified 'master' Firefox and Chrome windows, used purely to keep the browsers running all the time so I have fast access to new Firefox and Chrome windows.

Sidebar: Me and 'dedicated purpose' virtual screens

Although I've never tried to work in the 'dedicated purpose' style of virtual screen usage, I'm pretty sure that I would rapidly get angry at constantly flipping back and forth between eg the mail virtual screen and my working virtual screen. The reality of my computer usage is that I very rarely concentrate on a single thing for a significant time; it's much more common for me to be moving back and forth between any number of activities and fidgets.

If I was a developer I could see this changing, and in fact that would probably be a good thing. Then it would be an advantage that I had to go through the effort to change to the mail screen to check my email, because I'd be that much less likely to interrupt my programming to do so.

MyVirtualScreenUsage written at 01:11:09; Add Comment

2015-06-06

The security danger of exploitable bugs in file format libraries

Lately there have been a raft of security bugs of the form 'the standard open source library for dealing with file format X can be made to do bad things if it opens a specially crafted file in that format'. Some of the time that is 'run arbitrary code'. A bunch of these bugs have been found with the new fuzzer afl; see its 'bug-o-rama' trophy case.

At one level, that these bugs exist in many libraries for handling various file formats is not too surprising. A great deal of them are old libraries, and in general such libraries have generally assumed that they are not being run in any security sensitive context; instead they usually assumed you were using them on your own files and if you want to run arbitrary code as yourself, well, you already can. This can lead to these bugs not seeming too alarming.

There are two reasons to be worried about this today. First, in practice you get a lot of files from other people over the Internet; your browser downloads them (and often tries to display them), your mail client gets them in mail (and often tries to display them), and so on. However this is mostly a desktop risk and is relatively well understood (and many browsers and mail clients are using hardened libraries, although people keep finding new attack points).

Unfortunately there is another risk on Unix systems, and that is smart services that attempt to do content type detection and then content conversion for you. The dangerous poster child for this is the CUPS printer system, but there are probably others out there. In normal default setups, CUPS will try very hard to take random files that users hand it and turn them into something printable. This process involves both questionable content sniffing and, obviously, reading and interpreting all sorts of file formats. CUPS almost certainly uses standard libraries and programs for all of this, which means that exploitable vulnerabilities in these libraries can be used to break into the CUPS user on any system where CUPS is doing these conversions (and CUPS likes doing them on the print server).

(Another possible attack vector is email anti-spam, anti-virus systems. These almost certainly open .zip files using some library and may try to do things like peer inside PDFs and various '* Office' file formats to look for bad things.)

In general we've had a whole parade of troubles with any system that reads attacker-supplied input. We really should be viewing such things with deep suspicion and limiting their deployment, even if it's too late in the case of CUPS.

FormatLibCodeExecDanger written at 01:13:18; Add Comment

2015-06-01

The problem with 'what is your data worth?'

Every so often, people who are suggesting that you spend money on something will use the formulation of 'what is your data worth?' or 'your data should be worth ...' (often 'at least this much' follows in some variation). Let's ignore the FUD issues involved and talk only about another problem with this: it puts the cart before the horse by assuming that the data comes first and then the money arrives afterwards. Given data, you are called on to spend as much as required in order to deal with it however people think you're supposed to.

At least around here in the university, this is almost always exactly backwards. In reality the money comes first and we get however much data can fit into it. If not enough data can fit, people will compromise on attributes of the data such as the redundancy level, expensive storage systems with features that are not absolutely essential, and even performance. In extreme cases, people take a deep breath and have less data. What they basically never do is find more money so they can have better storage.

(Sometimes this works in reverse when the costs shift in our favour. Then we wind up with lots of storage and can shift some of the money to better, less compromised features. This is how we went from RAID 5 to RAID 1 storage.)

One part of this is almost certainly that we basically have no ROI. As part of this, the storage we tend to be buying is vague and fuzzy storage without firm metrics for things like performance and durability attached to it. Sure, more performance would be nice, but broadly there's nothing that you can point to to say 'our vital website/database/etc is not running well enough, this must be better'.

(Nor can we establish such metrics out of the air in any meaningful way. Real SLAs must come from business needs because that is the only way that money will be spent in order to satisfy them.)

I suspect that this situation is not entirely unique to us and universities. Businesses undertake any number of 'would be nice to have' things, and also they ultimately have constraints on how much money they can spend on even important things.

PS: there are limits on this 'any performance is acceptable', of course, but they tend to be comparatively way out in left field. Fundamentally there is no magic pot of money that we can get if we just make big enough puppydog eyes, so getting significantly more money basically needs to be a situation where there is a clear and pressing problem that is obvious to everyone, whether that is space or performance or redundancy or whatever.

DataWorthIsBackwards written at 23:44:00; Add Comment

2015-05-30

What I'm doing in reaction to Logjam (for HTTPS, SSH, and IKE)

Logjam covers two weaknesses in Diffie-Hellman key exchanges: the ability to downgrade TLS sessions to use extremely weak DH key exchanges, and the potential ability to break DH key exchanges using common, known primes of 1024 bits and below. Logjam affects at least TLS, SSH, and IPSec's IKE protocol, all of which I use. Since Logjam was announced I've been working on figuring out what I can and should do in reaction to it, which in part involved looking at my situation and software, and I think I've come to answers now.

For TLS on my personal site I used the Logjam sysadmin recommendations to generate my own 2048-bit prime for DH key exchange. I haven't put their lighttpd cipher suite suggestion into place because I don't fully trust it to be at least as good as my current set and also every time I touch lighttpd's cipher selection it's a pain in the rear. Sometime I will switch to Apache and then I'll adopt whatever Mozilla's current SSL configuration recommendations are.

(My server already scores decently high on the Qualys SSL server test and doesn't have any export ciphers allowed.)

For SSH, things get more complicated. SSH is not subject to downgrade attacks and OpenSSH already puts the vulnerable DH key exchange algorithm last in the preference list, so the only connections vulnerable to this issue are ones where either the client or the server doesn't support anything else. Unfortunately we sort of have some of those; Illumos aka Solaris SSH is so old that it only supports the vulnerable algorithm and 'diffie-hellman-group-exchange-sha1', which uses custom moduli of some size (hinted by the client, see RFC 4419). If I am reading 'ssh -vv' debug output correctly, modern OpenSSH clients ask for and get DH primes that are larger than 1024 bits even here, so we're safe. If we're not safe, there's probably nothing I can do about it.

(The theoretically 'custom' moduli of d-h-g-e-sha1 may be relatively standardized by mechanisms such as OpenSSH shipping a precreated /etc/ssh/moduli, but if so I'm still safe since connections seem to be using more than 1024 bits.)

So at the moment I don't intend to make any SSH configuration changes. I'm satisfied that any client connecting to our servers that can do better will be doing better, and I don't know if we have any remaining clients that literally can't manage anything better. My client connections will do better to any server where it's possible and I don't want to lose access to anything around here with a really terrible SSH server (most likely some embedded lights out management system from the dark ages).

The situation for IKE turns out to be similar. IKE is apparently not vulnerable to downgrade attacks in normal operation (at least in my setup) and Libreswan, the IKE software I'm using, defaults to using strong primes when talking to itself (per their FAQ answer, also). In this it turns out to be a good thing that I specifically insist on my configuration using IKEv2, which has stronger defaults. Since my only (current) use of IKE IPSec is for two strongly configured hosts to talk to each other, I don't need to do anything in order to keep being secure.

(We have a L2TP VPN server for our users. I suspect that it's already choosing strong defaults when it can, ie when clients support it, but we may need to do some configuration tweaks to it at some point. Disabling any current IKE modp1024 support is probably not viable because it runs the risk of cutting off clients in order to deal with a relatively low risk.)

LogjamMyReactions written at 01:28:14; Add Comment

2015-05-10

Our mail submission system winds up handling two sorts of senders

Yesterday I mentioned that while in theory our mail submission system could use sender verification to check whether a MAIL FROM address at an outside domain was valid, but that I didn't feel this was worth it. One of the reasons I feel this way is that I don't think this check will fail very often for most outside domains, and to do that I need to talk about how we have two sorts of senders: real people and machines.

Real people are, well, real people with a MUA who are sending email out through us. My view is that when real people may send out email using outside domains in their From: address, it's extremely likely that this address will be correct; if it's not correct, the person is probably going to either notice it or get told by people they are trying to talk to through some out of band mechanism. Unless you're very oblivious and closed off, you're just not going to spend very long with your MUA misconfigured this way. On top of it, real people have to explicitly configure their address in their MUA, which means there is a whole class of problems that get avoided.

Machines are servers and desktops and everything we have sitting around on our network that might want to send status email, report in to its administrator, spew out error reports to warn people of stuff, and so on. Email from these machines is essentially unidirectional (it goes out from the machine but not back), may not be particularly frequent, and is often more or less automatically configured. All of this makes it very easy for machines to wind up with bad or bogus MAIL FROMs. Often you have to go out of your way during machine setup in order to not get this result.

(For instance, many machines will take their default domain for MAIL FROMs from DNS PTR results, which malfunctions in the presence of internal private zones.)

Most broken machine origin addresses are easily recognized, because they involve certain characteristic mistakes (eg using DNS PTR results as your origin domain). Many of these addresses cannot be definitively failed with sender verification because, for example, the machine doesn't even run a SMTP listener that you can talk to.

You can mostly use sender verification for addresses from real people, but even ignoring the other issues there's little point because they'll almost never fail. Real people will almost always be using sender addresses from outside domains, not from internal hostnames.

MailSubmissionTwoSenders written at 01:45:03; Add Comment

2015-05-09

What addresses we accept and reject during mail submission

Like many places, our mail setup includes a dedicated mail submission machine (or two). I mentioned yesterday that this submission machine refuses some MAIL FROM addresses, so today I want to talk about what we accept and refuse during mail submission and why.

When we were designing our mail submission configuration many years ago, our starting point was that we didn't expect clients to deal very well if the submission server gave them a failure response. What you'd like is for the MUA to notice the error, report it, give you a chance to re-edit the email addresses involved, and so on and so forth. What we actually expected would happen would be some combination of lost email, partially delivered email (if some RCPT TOs failed but others succeeded), and awkward interfaces for dealing with failed email sending. So a big guiding decision was that our mail submission machine should accept the email if at all possible, even if we knew that it would partially or completely fail delivery. It was better to accept the email and send a bounce rather than count on all of the MUAs that our users use to get it right.

(Some but not all RCPT TO addresses failing during SMTP is a somewhat challenging problem for any MUA to deal with. How do you present this to the user, and what do you want to do when the user corrects the addresses? For example, if the user corrects the addresses and resends, should it be resent to all addresses or just the corrected ones? There's all sorts of UI issues involved.)

Given that our recovery method for bad destination addresses is sending a bounce, we need to have what at least looks like a valid MAIL FROM to send the bounce back to; if we don't we can't send bounces, so we're better off rejecting during SMTP and hoping that the MUA will do something sensible. For email addresses in outside domains, the practical best we can do is verify that the domain exists. For email addresses in our own domain, we can check that the local part is valid (using our list of valid local parts), so we do.

(We also do some basic safety checks for certain sorts of bad characters and bad character sequences in MAIL FROM and RCPT TO addresses. These probably go beyond what the RFCs require and may not be doing anything useful these days; we basically inherited them from the stock Ubuntu configuration of close to a decade ago.)

We allow people to use MAIL FROM addresses that are not in our domain in part because some people in the department have a real need to do this as part of their work. In general we log enough source information that if anyone abuses this we can find them and deal with this.

(You might say 'but what about spammers compromising accounts and sending spam through you with forged origin addresses?' My answer is that that's a feature.)

PS: In theory checking outside domain MAIL FROM addresses is one place where sender verification has a real justification, and you can even legitimately use the null sender address for it. In practice there are all sorts of failure modes that seem likely to cause heartburn and it's just not worth it in my opinion.

MailSubmissionAcceptReject written at 00:58:08; Add Comment

2015-05-08

Sometimes it's useful to have brute force handy: an amusing IPMI bug

Once upon a time we had gotten in some new servers. These servers had an IPMI and the IPMI could be configured to send out email alerts if something happened, like a fan stopping or a power supply losing power. Getting such alerts (where possible) seemed like a good idea, so I dutifully configured this in the IPMI's web interface. Sensibly, the IPMI needed me to set the origin address for the email, so I set it to sm-ipmi@<us> (and then made sure there was an sm-ipmi alias, so our mail submission machine would accept the email).

Of course, configurations are never quite done until they're tested. So I poked the IPMI to send me some test email. No email arrived. When I went off to our mail submission machine to look at its logs, I got rather a surprise; the logs said the machine had dutifully rejected a message that claimed a MAIL FROM address of =sm-ipmi@<us>.

While the insides of an IPMI's embedded software are inscrutable (at least to lazy sysadmins who are not security researchers), this smells like some form of classic data storage mismatch bug. The web interface thinks the email address should be stored with an '=' in front, maybe as an 'X=Y' thing, whereas whatever is actually using the address either has an off by one character parsing bug or doesn't want the extra leading = that the web interface is adding when it stores it.

There are probably a bunch of ways we could have dealt with this. As it happens our mail system is flexible enough to let us do the brute force approach: we just defined an alias called '=sm-ipmi'. Our mail system is willing to accept an '=' in local parts, even at the start, so that's all it took to make everything happy. It looks a little bit peculiar in the actual email messages, but that's just a detail.

A more picky email system would have given us more heartburn here. In a way we got quite lucky that none of the many levels of checks and guards we have choked on this. Our alias generation system was willing to see '=' as a valid character, even at the start; the basic syntax checks we do on MAIL FROM didn't block a = at the start; Exim itself accepts such a MAIL FROM local part and can successfully match it against things. I've used mail systems in the past that were much more strict about this sort of stuff and they'd almost certainly have rejected such an address out of hand or at least given us a lot of trouble over it.

(I don't even know if such an address is RFC compliant.)

The whole situation amuses me. The IPMI has a crazy, silly bug that should never have slipped through development and testing, and we're dealing with it by basically ignoring it. We can do that because our mail system is itself willing to accept a rather crazy local part as actually existing and being valid, which is kind of impressive considering how many different moving parts are involved.

PS: I call this the brute force solution because 'make an alias with a funny character in it' is more brute force than, say, 'figure out how to use sender address rewriting to strip the leading = that the IPMI is erroneously putting in there'.

PPS: Of course, some day maybe we'll update the IPMI firmware and suddenly find the notification mail not going through because the IPMI developers noticed the bug and fixed it. I suppose I should add the 'sm-ipmi' alias back in, just in case.

IPMIEmailBug written at 02:18:46; Add Comment

2015-05-04

Monitoring tools should report timestamps (and what they're monitoring)

This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote some simple tools to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this:

 40.33 MB/s RX  56.54 MB/s TX   packets/sec: 50331 RX 64482 TX

I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me.

Recently we've been looking at a NFS fileserver lockup problem, and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles.

So now I have a version of these tools with a somewhat different output format:

e1000g1 23:10:08  14.11 MB/s RX  77.40 MB/s TX   packets/sec: 37791 RX 66359 TX

Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines).

I was lucky here in that adding a timestamp plus typical interface names still keep output lines under 80 characters. But even in cases where adding this information would widen the output lines, well, I can widen my xterm windows and it's better to have this information than to have to reconstruct it afterwards. So in the future I think all of my monitoring tools are at least going to have an option to add a timestamp and similar information, and they might print it all the time if it fits (as it does here).

PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry.

PPS: This idea is not a miracle invention of mine by any means. In fact I shamelessly copied it from how useful the timestamps printed out by tools like arcstat are. When I noticed how much I was using those timestamps and how nice it was to be able to scroll back, spot something odd, and say 'ah, this happened at ...' right away, I smacked myself in the forehead and did it for all of the monitoring commands I was using. Fortunately many OmniOS commands like vmstat already have an option to add timestamps, although it's sometimes kind of low-rent (eg vmstat prints the timestamp on a separate line, which doubles how many lines of output it produces and thus halves the effective size of my scrollback buffer).

ReportTimeAndId written at 23:58:52; Add Comment

2015-05-03

Sometimes knowing causes does you no good (and sensible uses of time)

Yesterday, I covered our OmniOS fileserver problem with overload and mentioned that the core problem seems to be (kernel) memory exhaustion. Of course once we'd identified this I immediately started coming up with lots of theories about what might be eating up all the memory (and then not giving it back), along with potential ways to test these theories. This is what sysadmins do when we're confronted with problems, after all; we try to understand them. And it can be peculiarly fun and satisfying to run down the root cause of something.

(For example, one theory is 'NFS TCP socket receive buffers', which would explain why it seems to need a bunch of clients all active.)

Then I asked myself an uncomfortable question: was this going to actually help us? Specifically, was it particularly likely to get us any closer to having OmniOS NFS fileservers that did not lock up under surges of too-high load? The more I thought about that, the more gloomy I felt, because the cold hard answer is that knowing the root cause here is unlikely to do us any good.

Some issues are ultimately due to simple and easily fixed bugs, or turn out to have simple configuration changes that avoid them. It seems unlikely that either are the case here; instead it seems much more likely to be a misdesigned or badly designed part of the Illumos NFS server code. Fixing bad designs is never a simple code change and they can rarely be avoided with configuration changes. Any fix is likely to be slow to appear and require significant work on someone's part.

This leads to the really uncomfortable realization that it is probably not worth spelunking this issue to explore and test any of these theories. Sure, it'd be nice to know the answer, but knowing the answer is not likely to get us much closer to a fix to a long-standing and deep issue. And what we need is that fix, not to know what the cause is, because ultimately we need fileservers that don't lock up every so often if things go a little bit wrong (because things go a little bit wrong on a regular basis).

This doesn't make me happy, because I like diagnosing problems and finding root causes (however much I gripe about it sometimes); it's neat and gives me a feeling of real accomplishment. But my job is not about feelings of accomplishment, it's about giving our users reliable fileservice, and it behooves me to spend my finite time on things that are most likely to result in that. Right now that does not appear to involve diving into OmniOS kernel internals or coming up with clever ways to test theories.

(If we had a lot of money to throw at people, perhaps the solution would be 'root cause the problem then pay Illumos people to do the kernel development needed to fix it'. But we don't have anywhere near that kind of money.)

KnowingCausesIsNoCure written at 01:32:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.