Wandering Thoughts

2017-08-08

We care more about long term security updates than full long term support

We like running so-called 'LTS' (Long Term Support) releases of any OS that we use, and more broadly of any software that we care about, because using LTS releases allows us to keep running the same version for a fairly long time. This is generally due to pragmatics on two levels. First, testing and preparing a significant OS upgrade simply takes time and there's only so much time available. Second, upgrades generally represent some amount of increased risk over our existing environment. If our existing environment is working, why would we want to mess with that?

(Note that our general environment is somewhat unusual. There are plenty of places where you simply can't stick with kernels and software that is more than a bit old, for various reasons.)

But the general idea of 'LTS' is a big tent and it can cover many things (as I sort of talked about in an entry on what supporting a production OS means to me). As I've wound up mentioning in passing recently (eg here), the thing that we care about most is security updates. Sure, we'd like to get our bugs fixed too, but we consider this less crucial for at least two reasons.

First and most importantly, we can reasonably hope to not hit any important bugs once we've tested an OS release (or at least had it in production for an initial period), so if things run okay now they'll keep running decently into the future even if we do nothing to them. This is very much not true of security problems, for obvious reasons; to put it one way, attackers hit your security exposures for you and there's not necessarily much you can do to stop them short of updating. Running an OS without current security updates is getting close to being actively dangerous; running without the possibility of bug fixes is usually merely inconvenient at most.

(There can be data loss bugs that will shift the calculations here, but we can hope that they're far less common than security issues.)

Second, I have to admit that we're making a virtue of more or less necessity, because we generally can't actually get general updates and bug fixes in the first place. For one big and quite relevant example, Ubuntu appears to fix only unusually egregious bugs in their LTS releases. If you're affected by mere ordinary bugs and issues, you're stuck. This is one of the tradeoffs you get to make with Ubuntu LTS releases; you trade off a large package set for effectively only getting security updates (and it has been this way for a long time). More broadly, no LTS vendor promises to fix every bug that every user finds, only the sufficiently severe and widely experienced ones. So just because we run into a bug doesn't mean that it's going to get fixed; it may well not be significant enough to be worth the engineering effort and risk of an update on the vendor's part.

(There is also the issue that if we hit a high-impact bug, we can't wait for a fix to be developed upstream and slowly pushed down to us. If we have systems falling over, we need to solve our problems now, in whatever way that takes. Sometimes LTS support can come through with a prompt fix, but more often than not you're going to be waiting too long.)

LongtermSecurityVersusSupport written at 01:27:28; Add Comment

2017-08-06

Our decision to restrict what we use for developing internal tools

A few years ago, we (my small sysadmin group) hit a trigger point where we realized that we were writing internal sysadmin tools (including web things) in a steadily increasing collection of programming languages, packages, and environments for doing things like web pages and apps. This was fine individually but created a collective problem, because in theory we want everyone to be able to at least start to support and troubleshoot everything we have running. The more languages and environments we use across all of our tools, the harder this gets. As things escalated and got more exotic, my co-workers objected quite strongly and, well, they're right.

The result of this was that we decided to standardize on using only a few languages and environments for our internally developed tools, web things, and so on. Our specific choices are not necessarily the most ideal choices and they're certainly a product of our environment, both in what people already knew and what we already had things written in. For instance, given that I've written a lot of tools in Python, it would have been relatively odd to exclude it from our list.

Since the whole goal of this is to make sure that co-workers don't need to learn tons of things to work on your code, we're de facto limiting not just the basic choice of language but also what additional packages, libraries, and so on you use with it. If I load my Python code down with extensive use of additional modules, web frameworks, and so on, it's not enough for my co-workers to just know Python; I've also forced them to learn all those packages. Similar things hold true for any language, including (and especially) shell scripts. Of course sometimes you absolutely need additional packages (eg), but if we don't absolutely need it our goal is to stick to doing things with only core stuff even if the result is a bit longer and more tedious.

(It doesn't really matter if these additional modules are locally developed or come from the outside world. If anything, outside modules are likely to be better documented and better supported than ones I write myself. Sadly this means that the Python module I put together myself to do simple HTML stuff is now off the table for future CGI programs.)

I don't regret our overall decision and I think it was the right choice. I had already been asking my co-workers if they were happy with me using various things, eg Go, and I think that the tradeoffs we're making here are both sensible and necessary. To the extent that I regret anything, I mildly regret that I've not yet been able to talk my co-workers into adding Go to the mix.

(Go has become sort of a running joke among us, and I recently got to cheerfully tell one of my co-workers that I had lured him into using and even praising my call program for some network bandwidth testing.)

Note that this is, as mentioned, just for my small group of sysadmins, what we call Core in our support model. The department as a whole has all sorts of software and tools in all sorts of languages and environments, and as far as I know there has been no department-wide effort to standardize on a subset there. My perception is that part of this is that the department as a whole does not have the cross-support issue we do in Core. Certainly we're not called on to support other people's applications; that's not part of our sysadmin environment.

Sidebar: What we've picked

We may have recorded our specific decision somewhere, but if so I can't find it right now. So off the top of my head, we picked more or less:

  • Shell scripts for command line tools, simple 'display some information' CGIs, and the like, provided that they are not too complex.
  • Python for command line tools.
  • Python with standard library modules for moderately complicated CGIs.
  • Python with Django for complicated web apps such as our account request system.

  • Writing something in C is okay for things that can't be in an interpreted language, for instance because they have to be setuid.

We aren't actively rewriting existing things that go outside this, for various reasons. Well, at least if they don't need any maintenance, which they mostly don't.

(We have a few PHP things that I don't think anyone is all that interested in rewriting in Python plus Django.)

LimitingToolDevChoices written at 02:03:10; Add Comment

2017-07-29

The differences between how SFTP and scp work in OpenSSH

Although I normally only use scp, I'm periodically reminded that OpenSSH actually supports two file transfer mechanisms, because there's also SFTP. If you are someone like me, you may eventually wind up wondering if these two ways of transferring files with (Open)SSH fundamentally work in the same way, or if there is some real difference between them.

I will skip to the end: sort of yes and sort of no. As usually configured, scp and SFTP wind up working in the same way on the server side but they get there through different fundamental mechanisms in the SSH protocol and thus they take somewhat different paths in OpenSSH. What happens when you use scp is simpler to explain, so let's start there.

How scp works is the same as how rsync does. When you do 'scp file apps0:/tmp/', scp uses ssh to connect to the remote host and run a copy of scp with a special undocumented flag that means 'the other end of this conversation is another scp, talk to it with your protocol'. You can see this in ps output on your machine, where it will look something like this:

/usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -- apps0 scp -t /tmp/

(This is how the traditional BSD rcp command works under the hood, and the HISTORY section of the scp manpage says that scp was originally based on rcp.)

By contrast, how SFTP works is that it is what is called a SSH subsystem, which is a specific part of the SSH connection protocol. More specifically it is the "sftp" subsystem, for which there is actually a draft protocol specification (with OpenSSH extensions). Since the client explicitly asks the server for SFTP (instead of just saying 'please run this random Unix command'), the server knows what is going on and can implement support for its end of the SFTP protocol in whatever way it wants to.

As it happens, the normal OpenSSH server configuration implements the "sftp" subsystem by running sftp-server (this is configured in /etc/ssh/sshd_config). For various reasons it does so via your login shell, so if you peek at your server's process list while you're running a sftp session, it will look like this:

 cks   25346 sshd: cks@notty
  |    25347 sh -c /usr/lib/openssh/sftp-server
   |   25348 /usr/lib/openssh/sftp-server

On your local machine, the OpenSSH sftp command doesn't bother to have its own implementation of the SSH protocol and so on; instead it runs ssh in a special mode to invoke a remote subsystem instead of a remote command:

/usr/bin/ssh -oForwardX11 no -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -oProtocol 2 -s -- apps0 sftp

However, this is not a universal property of SFTP client programs. A SFTP client program may embed its own implementation of SSH, and this implementation may support different key exchange methods, ciphers, and authentication methods than the user's regular full SSH does.

(We ran into a case recently where a user had a SFTP client that only supported weak Diffie-Hellman key exchange methods that modern versions of OpenSSH sshd don't normally support. The user's regular SSH client worked fine.)

So in the end, scp and SFTP both wind up running magic server programs under shells on the SSH server, and they both run ssh on the client. They give slightly different arguments to ssh and obviously run different programs (with different arguments) on the server. However, SFTP makes it more straightforward for the server to implement things differently because the client explicitly asks 'please talk this documented protocol with me'; unlike with scp, it is the server that decides to implement the protocol by running 'sh -c sftp-server'. OpenSSH sshd has an option to implement SFTP internally, and you could easily write an alternate SSH daemon that handled SFTP in a different way.

It's theoretically possible to handle scp in a different way in your SSH server, but you would have to recognize scp by knowing that a request to run the command 'scp -t <something>' was special. This is not unheard of; git operates over SSH by running internal git commands on the remote end (cf), and so if you want to provide remote git repositories without exposing full Unix accounts you're going to have to interpret requests for those commands and do special things. Github does this along with other magic, especially since everyone uses the same remote SSH login (that being git@github.com).

SSHHowScpAndSFTPWork written at 01:06:13; Add Comment

2017-07-28

Our (Unix) staff groups problem

Last week, I tweeted:

The sysadmin's lament: I swear, I thought this was a small rock when I started turning it over.

As you might suspect, there is a story here.

Our central Unix systems have a lot of what I called system continuity; the lineage of some elements of what we have today traces back more than 25 years (much like our machine room). One such element is our logins and groups, because if you're going to run Unix for a long time with system continuity, those are a core part of it.

(Actual UIDs and GIDs can get renumbered, although it takes work, but people know their login names. Forcing changes there is a big continuity break, and it usually has no point anyway if you're going to keep on running Unix.)

Most people on our current Unix systems have a fairly straightforward Unix group setup for various reasons. The big exception is what can broadly be called 'system staff', where we have steadily accumulated more and more groups over time. Our extended system continuity means that we have lost track of the (theoretical) purpose of many groups, which have often wound up with odd group membership as well. Will something break if we add or remove some people from a group that looks almost right for what we need now? We don't know, so we make a new group; it's faster and simpler than trying to sort things out. Or in short, we've wound up with expensive group names.

This was the apparently small rock that I was turning over last week. The exact sequence is beyond the scope of this entry, but it started with 'what group should we put this person in now', escalated to 'this is a mess, let's reform things and drop some of these obsolete groups', and then I found myself writing long worklog messages about peculiar groups I'd found lurking in our Apache access permissions, groups that actually seemed to duplicate other groups we'd created later.

(Of course we use Unix groups in Apache access permissions. And in Samba shares. And in CUPS print permissions. And our password distribution system. And probably other places I haven't found yet.)

OurStaffGroupsProblem written at 02:49:44; Add Comment

2017-07-26

The continuity of broad systems or environments in system administration

What I'm going to call system continuity for my lack of a better phrase is the idea that some of the time in some places, you can trace bits of your current environment back through your organization's history. You're likely not still using the same physical hardware, you're probably not using the same OS version and possibly even the OS, the software may have changed as may have how you administer it, but you can point at elements of how things work today and say 'they're this way because N years ago ...'. To put it one way, system continuity means that things have a lineage.

As an example, you have some system continuity in your email system if you're still supporting and using people's email addresses from your very first mail system you set up N years ago, even though you moved from a basic Unix mailer to Exchange and now to a cloud-hosted setup. You don't have system continuity here if at some point people said 'we're changing everyone's email address and the old ones will stop working a year from now'.

You can have system continuity in all sorts of things, and you can lack it in all sorts of things. One hallmark of system continuity is automatic or even transparent migrations as far as users are concerned; one marker for a lack of it is manual migrations. If you say 'we've built a new CIFS server environment, here are your new credentials on it, copy data from our old one yourself if you want to keep it', you probably don't have much system continuity there. System continuity can be partial (or perhaps 'fragmented'); you might have continuity in login credentials but not in the actual files, which you have to copy to the new storage environment yourself.

(It's tempting to say that some system continuities are stacked on top of each other, but this is not necessarily the case. You can have a complete change of email system, including new login credentials, but still migrate everyone's stored mail from the old system to the new one so that people just reconfigure their IMAP client and go on.)

Not everyone has system continuity in anything (or at least anything much). Some places just aren't old enough to have turned over systems very often; they're still on their first real system for many things and may or may not get system continuity later. Some places don't try to keep anything much from old systems for various reasons, including that they're undergoing a ferocious churn in what they need from their systems as they grow or change directions (or both at once). Some places explicitly decide to discard some old systems because they feel they're better off re-doing things from scratch (sometimes this was because the old systems were terrible quick hacks). And of course some organizations die, either failing outright or being absorbed by other organizations that have their own existing systems that you get to move to. Especially in today's world, it probably takes an unusually stable and long-lived organization to build up much system continuity. Unsurprisingly, universities can be such a place.

(Within a large organization like a big company or a university, system continuity is probably generally associated with the continuity of a (sub) group. If your group has its own fileservers or mail system or whatever, and your group gets dissolved or absorbed by someone else, you're likely going to lose continuity in those systems because your new group probably already has its own versions of those. Of course, even if groups stay intact there can be politics over where services should be provided and who provides them that result in group systems being discarded.)

ContinuityOfSystems written at 22:34:40; Add Comment

2017-07-12

Understanding the .io TLD's DNS configuration vulnerability

First there was Matthew Bryant's The .io Error - Taking Control of All .io Domains With a Targeted Registration, about a configuration error that allegedly allowed you to take over control of some .io nameservers, and then there was a response to it, Matt Pounsett's The .io Error: A Problem With Bad Optics, But Little Substance, which argued that this was much ado about nothing much. While I agree that the consequences are less severe than Bryant thought, I think that Pounsett's article understates the risks itself (and I believe doesn't correctly explain what's going on in the DNS here). In any case, the whole thing confused me and other people, so I'm going to write my understanding of things up here.

Let's start with the basics of compromising a domain through dangling nameserver delegation. Suppose you find a domain barney.io that lists ns1.fred.ly as one of its two nameservers, and fred.ly is not registered (worse nameserver mistakes happen). To attack barney.io you register fred.ly and create a ns1.fred.ly A record that points to a nameserver that you're running. Some portion of the people looking up information in barney.io will wind up querying your nameserver, and at that point you can give them whatever answers you want. If they're asking their original question, you can directly lie to them (telling people that all MX entries in barney.io point to harvestmail.fred.ly, for example). If they're making NS queries to check for zone delegation, you can just give them NS records that point to you and start lying some more when they follow those NS records.

(You can then increase how many people will talk to ns1.fred.ly by DOSing the other barney.io DNS server off the Internet.)

This is more or less what the setup was for .io. Among .io's nameservers were ns-a1.io through ns-a4.io, and all of those names could be registered as domains in .io and then given A records in your DNS data for your new domain(s) (and Matthew Bryant did just this with ns-a1.io). However, there was an important difference that made this less severe than my example, and that's that .io had active glue records in the root zone for those names that pointed people to the IP addresses of the real nameservers. With these glue records present, a client didn't talk to Matthew Bryant's DNS server just because it decided to use ns-a1.io as part of resolving a .io name; if it believed and used the glue records, it would wind up talking to the real nameserver. You only had your query diverted to Bryant's DNS server if you decided to send a query to ns-a1.io but not use the IP from the glue record and instead look it up directly.

Using data from glue records instead of looking things up yourself is common but not mandatory, and there are various reasons why a resolver would not do so. Some recursive DNS servers will deliberately try to check glue record information as a security measure; for example, Unbound has the harden-referral-path option (via Tony Finch). Since the original article reported seeing real .io DNS queries being directed to Bryant's DNS server, we know that a decent number of clients were not using the root zone glue records. Probably a lot more clients were still using the glue records, through.

(There are a bunch of uncertainties about just what DNS data was being returned by who during the incident. The original article shows a reply from a root server and that probably didn't change, but we don't know what the official .io servers themselves started returning as glue records for .io during the time that ns-a1.io was active as a domain registration. I will decline to speculate on what was the likely result here.)

Given my history with glue record hell, it amuses me that this is a case where dangling glue records helped instead of hurt, making a problem less severe than it would otherwise have been. Had there been no glue records or incomplete glue records for the .io zone, there would have been more danger (or at least the danger would have been more clearer).

(In this case the presence of the glue records was mandatory, since these were NS names inside the zone itself. Without glue records in the root zone, you would have a chicken and egg problem in getting the IP address of, say, a0.nic.io.)

PS: As far as I can see from Bryant's article, he didn't realize that the root zone glue records would cause many clients to not query his DNS servers, significantly reducing the severity of someone having control over the names of four of the seven .io DNS servers. As far as Pounsett's article goes, he appears to more or less spot the issue with root glue but doesn't explain it and appears to expect all clients to use the glue all of the time (which is demonstrably not the case). I think he may also be confusing the data in the .io zone with the root zone glue for .io. Note that it's not necessary to get your IP address for ns-a1.io included in the .io zone; to make some clients start talking to you, it's sufficient for NS records for ns-a1.io to show up and ideally to occlude the A and AAAA records.

(We know that Bryant's NS records showed up in the .io zone. We don't know if they occluded the A record for ns-a1.io that was there, but it seems likely that they did.)

Sidebar: What I suspect went wrong in .io's procedures

It seems quite likely that ns-a1.io through ns-a4.io were intended to be purely host names of DNS servers, not domain names, much like my example of ns1.fred.ly. However, they were placed directly in the apex of a zone (.io) that allows people to register domains, and I suspect that the people running the IO zone forgot to tell the people running the IO registry that these names existed in the zone as host names and should be locked out from domain registration. That's been fixed now, obviously, and WHOIS tells me they're 'Reserved by Registry'.

(This is thus a different failure mode than having NS records for your domain or TLD that point to hosts in entirely unregistered domains. That's a pure failure, since the names don't exist at all except perhaps through lingering glue records. Here the names existed entirely properly, it's just that the IO registry was allowed to override them with new data.)

The problem doesn't come up for the other .io nameservers, which are all under nic.io, since nic.io is already a registered domain in .io.

UnderstandingIODNSIssue written at 23:49:57; Add Comment

Recursive DNS servers send the whole original query to authoritative servers

As a long term sysadmin, I usually feel that I have a solid technical grasp of DNS (apart from DNSSEC, which I ignore on principle). Then every so often, I get to find out that I'm wrong. Today is one of those days.

Before today, if you had asked me how a recursive DNS server did a lookup from authoritative servers, I would have told you what is basically the standard story. If you're looking up the A record for fred.blogs.example.com, your local recursive server first asks a random root server for the NS records for .com, then asks one of the .com DNS servers for the NS records for example.com, then asks one of those theoretically authoritative DNS servers, and so on. Although this describes the chain of NS delegations that your recursive DNS server typically gets back, it turns out that this doesn't accurately describe what your server usually sends as its queries. The normal practice today is that your recursive DNS server sends the full original query to each authoritative server. It doesn't ask the root servers 'what is the NS for .com'; instead it asks the root servers 'what is the A record for fred.blogs.example.com', and they send back an answer that is basically 'I have no idea, ask one of these .com nameservers'.

Once I thought about it, this behavior made a lot of sense because DNS clients don't know in advance where zone delegation boundaries are. It's common for there to be zone boundaries between each ., but it's not always the case; you can certainly have zones where the zone boundaries are further apart. You can even have zones where it varies by the name. Consider a hypothetical .ca zone operator who allows registration of both <domain>.<province>.ca (eg fred.on.ca) and <domain>.ca (eg bob.ca), and does not have separate <province>.ca zones; they just carry all of the data in one zone without internal NS records. Here bob.ca has a NS but on.ca doesn't, and your client certainly can't know which is which in advance. When the client has no idea where the zone boundaries are, the simple thing to do is to send the whole original query off to each step of the delegation chain and see what they say. This way you don't have to try any sort of backtracking when you ask for a NS for .on.ca and get a no data answer back.

Now, you might ask if sending the full query to all DNS servers in the chain like this has privacy implications. Why yes, it does, and there are proposed ways to work around it, such as RFC 7816 query minimization. Some DNS servers are already taking steps here; for example, current versions of Unbound have the qname-minimisation option in unbound.conf.

(My discovery is due to reading this article. I believe that the article overstates things a bit itself, but that's another entry (or see this Twitter thread).)

DNSRecursivesMakeFullQueries written at 01:15:35; Add Comment

2017-07-02

Moving to smaller fileservers for us probably means no more iSCSI SAN

In our environment, one major thing that drives us towards relatively big fileservers is aggregating and lowering the overhead of servers. Regardless of how big or small it is, any server has a certain minimum overhead cost due to needing things like a case and power supply, a motherboard, and a CPU. The result of this per-server overhead is economies of scale; a single server with 16 disk bays almost certainly costs less than two servers with 8 disk bays each.

We have a long history of liking to split our fileservers from our disk storage. Our current fileservers and our past generation of fileservers have both used iSCSI to talk to backend disk enclosures, and the generation of fileservers before them used Fibre Channel to talk to FC hardware RAID boxes. Splitting up the storage from the fileservice this way requires buying extra machines, which costs more; what has made this affordable is aggregating a fairly decent amount of disks in each box, so we don't have to buy too many extra ones.

If we're going to have smaller fileservers, as I've come to strongly believe we want, we're going to need more of them. If we're going to keep a similar design to our current setup, we would need more iSCSI backends to go with them. All of this means more machines and more costs. In theory we could lower costs by continuing to use 16-disk backends and share them between (smaller) fileservers (so two new fileservers would share a pair of backends), but in practice this would make our multi-tenancy issues worse and we would likely resist the idea fairly strongly. And we'd still be buying more fileservers.

If we want to buy a similar number of machines in total for our next generation fileservers but shrink the number of disks and the amount of storage space supported by each fileserver, the obvious conclusion is that we must get rid of the iSCSI backends. Hosting disks on the fileservers themselves has some downsides (per my entry about our SAN tradeoffs), but at a stroke it cuts the number of machines per fileserver from three to one. We could double the number of fileservers and still come out ahead on raw machine count. In a 10G environment, it also eliminates the need for two expensive 10G switches for the iSCSI networks themselves (and we'd want to go to 10G for the next generation of fileservers).

If we want to reduce the size of our fileservers but keep an iSCSI environment, we're almost certainly going to be faced with unappetizing tradeoffs. Considering the cost of 10G switch ports as well as everything else, our most likely choice would be to stop using two backends per fileserver; instead each fileserver would talk to a single 16-disk iSCSI backend (still using mirrored pairs of disks). This would increase the overall number of servers, but not hugely (we would go from 9 servers total for our HD-based production fileservers to 12 servers; the three fileservers would become six, and then we'd need six backends to go with them).

(It turns out that I also wrote about this issue a couple of years ago. At the time we weren't as totally convinced that our current fileservers are too big as designed, although we were certainly thinking about it, and I was less pessimistic about the added costs for extra servers if we shrink how big each fileserver is and so need more of them. (Or maybe the extra costs just hadn't struck me yet.))

SmallFileserversAndISCSI written at 01:22:44; Add Comment

2017-06-30

Our current generation fileservers have turned out to be too big

We have three production hard drive based NFS fileservers (and one special fileserver that uses SSDs). As things have turned out, usage is not balanced evenly across all three; one of them has by far the most space assigned and used on it and unsurprisingly is also the most active and busy fileserver.

(In retrospect, putting all of the pools for the general research group that most heavily uses our disk space on the same fileserver was perhaps not our best decision ever.)

It has been increasingly obvious to us for some time that this fileserver is simply too big. It hosts too much storage that is too actively used and as a result it's the server that most frequently encounters serious system issues and in general it frequently runs sufficiently close to its performance edge that a little extra load can push it over the edge. Even when everything is going well it's big enough to be unwieldy; scheduling anything involving it is hard, for example, because so many people use it.

(This fileserver also suffers the most from our multi-tenancy, since so many of its disks are used for so many active things.)

However this fileserver is not fundamentally configured any differently than the other two. It doesn't have less memory or more disks; it simply makes more use of them than the other two do. This means that all three of our fileservers are too big as designed. The only reason the other two aren't also too big in operation today is that not enough people have been interested in using them, so they don't have anywhere near as much space used and aren't as active in handling NFS traffic.

Now, how we designed our fileservers is not quite how they've wound up being operated, since they're running at 1G for all networking instead of 10G. It's possible that running at 10G would make this fileserver not too big, but I'm not particularly confident about that. The management issues would still be there, there would still be a large impact on a lot of people if (and when) the fileserver ran into problems, and I suspect that we'd run into limitations on disk IOPS and how much NFS fileservice a single machine can do even if we went to the extreme where all the disks were local disks instead of iSCSI. So I believe that in our environment, it's most likely that any fileserver with that much disk space is simply too big.

As a result of our experiences with this generation of fileservers, our next generation is all but certain to be significantly smaller, just so something like this can't possibly happen with them. This probably implies a number of other significant changes, but that's going to be another entry.

FileserversDesignedTooBig written at 23:03:27; Add Comment

Why big Exim queues are a problem for us in practice

In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.

(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)

In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.

(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)

In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.

(A surge of delivery to /var/mail is more tolerable for various reasons, and a surge of delivery to external addresses is pretty close to 'we don't care unless the queue becomes absurdly large'. Well, apart from the bit where it might be spam and high outgoing volumes might get our outgoing email temporarily blacklisted in general.)

Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.

EximWhyBigQueuesProblem written at 00:47:24; Add Comment

(Previous 10 or go back to June 2017 at 2017/06/26)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.