We care more about long term security updates than full long term support
We like running so-called 'LTS' (Long Term Support) releases of any OS that we use, and more broadly of any software that we care about, because using LTS releases allows us to keep running the same version for a fairly long time. This is generally due to pragmatics on two levels. First, testing and preparing a significant OS upgrade simply takes time and there's only so much time available. Second, upgrades generally represent some amount of increased risk over our existing environment. If our existing environment is working, why would we want to mess with that?
(Note that our general environment is somewhat unusual. There are plenty of places where you simply can't stick with kernels and software that is more than a bit old, for various reasons.)
But the general idea of 'LTS' is a big tent and it can cover many things (as I sort of talked about in an entry on what supporting a production OS means to me). As I've wound up mentioning in passing recently (eg here), the thing that we care about most is security updates. Sure, we'd like to get our bugs fixed too, but we consider this less crucial for at least two reasons.
First and most importantly, we can reasonably hope to not hit any important bugs once we've tested an OS release (or at least had it in production for an initial period), so if things run okay now they'll keep running decently into the future even if we do nothing to them. This is very much not true of security problems, for obvious reasons; to put it one way, attackers hit your security exposures for you and there's not necessarily much you can do to stop them short of updating. Running an OS without current security updates is getting close to being actively dangerous; running without the possibility of bug fixes is usually merely inconvenient at most.
(There can be data loss bugs that will shift the calculations here, but we can hope that they're far less common than security issues.)
Second, I have to admit that we're making a virtue of more or less necessity, because we generally can't actually get general updates and bug fixes in the first place. For one big and quite relevant example, Ubuntu appears to fix only unusually egregious bugs in their LTS releases. If you're affected by mere ordinary bugs and issues, you're stuck. This is one of the tradeoffs you get to make with Ubuntu LTS releases; you trade off a large package set for effectively only getting security updates (and it has been this way for a long time). More broadly, no LTS vendor promises to fix every bug that every user finds, only the sufficiently severe and widely experienced ones. So just because we run into a bug doesn't mean that it's going to get fixed; it may well not be significant enough to be worth the engineering effort and risk of an update on the vendor's part.
(There is also the issue that if we hit a high-impact bug, we can't wait for a fix to be developed upstream and slowly pushed down to us. If we have systems falling over, we need to solve our problems now, in whatever way that takes. Sometimes LTS support can come through with a prompt fix, but more often than not you're going to be waiting too long.)
Our decision to restrict what we use for developing internal tools
A few years ago, we (my small sysadmin group) hit a trigger point where we realized that we were writing internal sysadmin tools (including web things) in a steadily increasing collection of programming languages, packages, and environments for doing things like web pages and apps. This was fine individually but created a collective problem, because in theory we want everyone to be able to at least start to support and troubleshoot everything we have running. The more languages and environments we use across all of our tools, the harder this gets. As things escalated and got more exotic, my co-workers objected quite strongly and, well, they're right.
The result of this was that we decided to standardize on using only a few languages and environments for our internally developed tools, web things, and so on. Our specific choices are not necessarily the most ideal choices and they're certainly a product of our environment, both in what people already knew and what we already had things written in. For instance, given that I've written a lot of tools in Python, it would have been relatively odd to exclude it from our list.
Since the whole goal of this is to make sure that co-workers don't need to learn tons of things to work on your code, we're de facto limiting not just the basic choice of language but also what additional packages, libraries, and so on you use with it. If I load my Python code down with extensive use of additional modules, web frameworks, and so on, it's not enough for my co-workers to just know Python; I've also forced them to learn all those packages. Similar things hold true for any language, including (and especially) shell scripts. Of course sometimes you absolutely need additional packages (eg), but if we don't absolutely need it our goal is to stick to doing things with only core stuff even if the result is a bit longer and more tedious.
(It doesn't really matter if these additional modules are locally developed or come from the outside world. If anything, outside modules are likely to be better documented and better supported than ones I write myself. Sadly this means that the Python module I put together myself to do simple HTML stuff is now off the table for future CGI programs.)
I don't regret our overall decision and I think it was the right choice. I had already been asking my co-workers if they were happy with me using various things, eg Go, and I think that the tradeoffs we're making here are both sensible and necessary. To the extent that I regret anything, I mildly regret that I've not yet been able to talk my co-workers into adding Go to the mix.
(Go has become sort of a running joke among us, and I recently got
to cheerfully tell one of my co-workers that I had lured him into
using and even praising my
call program for some network bandwidth
Note that this is, as mentioned, just for my small group of sysadmins, what we call Core in our support model. The department as a whole has all sorts of software and tools in all sorts of languages and environments, and as far as I know there has been no department-wide effort to standardize on a subset there. My perception is that part of this is that the department as a whole does not have the cross-support issue we do in Core. Certainly we're not called on to support other people's applications; that's not part of our sysadmin environment.
Sidebar: What we've picked
We may have recorded our specific decision somewhere, but if so I can't find it right now. So off the top of my head, we picked more or less:
- Shell scripts for command line tools, simple 'display some information' CGIs, and the like, provided that they are not too complex.
- Python for command line tools.
- Python with standard library modules for moderately complicated CGIs.
- Python with Django for complicated web apps such as our account
- Writing something in C is okay for things that can't be in an interpreted language, for instance because they have to be setuid.
We aren't actively rewriting existing things that go outside this, for various reasons. Well, at least if they don't need any maintenance, which they mostly don't.
(We have a few PHP things that I don't think anyone is all that interested in rewriting in Python plus Django.)
The differences between how SFTP and
scp work in OpenSSH
Although I normally only use
scp, I'm periodically reminded that
OpenSSH actually supports two file transfer mechanisms, because
there's also SFTP. If you are someone like me, you may eventually
wind up wondering if these two ways of transferring files with
(Open)SSH fundamentally work in the same way, or if there is some
real difference between them.
I will skip to the end: sort of yes and sort of no. As usually
scp and SFTP wind up working in the same way on the
server side but they get there through different fundamental
mechanisms in the SSH protocol and thus they take somewhat different
paths in OpenSSH. What happens when you use
scp is simpler to
explain, so let's start there.
scp works is the same as how
rsync does. When you do '
ssh to connect to the remote host
and run a copy of
scp with a special undocumented flag that means
'the other end of this conversation is another
scp, talk to it
with your protocol'. You can see this in
ps output on your machine,
where it will look something like this:
/usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -- apps0 scp -t /tmp/
(This is how the traditional BSD
rcp command works under the hood, and
the HISTORY section of the
scp was originally based on
By contrast, how SFTP works is that it is what is called a SSH subsystem, which is a specific part of the SSH connection protocol. More specifically it is the "sftp" subsystem, for which there is actually a draft protocol specification (with OpenSSH extensions). Since the client explicitly asks the server for SFTP (instead of just saying 'please run this random Unix command'), the server knows what is going on and can implement support for its end of the SFTP protocol in whatever way it wants to.
As it happens, the normal OpenSSH server configuration implements
the "sftp" subsystem by running
sftp-server (this is configured in
/etc/ssh/sshd_config). For various reasons it does so via
your login shell, so if you peek at your server's process list
while you're running a
sftp session, it will look like this:
cks 25346 sshd: cks@notty | 25347 sh -c /usr/lib/openssh/sftp-server | 25348 /usr/lib/openssh/sftp-server
On your local machine, the OpenSSH
sftp command doesn't bother to
have its own implementation of the SSH protocol and so on; instead it
ssh in a special mode to invoke a remote subsystem instead of
a remote command:
/usr/bin/ssh -oForwardX11 no -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -oProtocol 2 -s -- apps0 sftp
However, this is not a universal property of SFTP client programs. A SFTP client program may embed its own implementation of SSH, and this implementation may support different key exchange methods, ciphers, and authentication methods than the user's regular full SSH does.
(We ran into a case recently where a user had a SFTP client that
only supported weak Diffie-Hellman key exchange methods
that modern versions of OpenSSH
sshd don't normally support. The
user's regular SSH client worked fine.)
So in the end,
scp and SFTP both wind up running magic server
programs under shells on the SSH server, and they both run
on the client. They give slightly different arguments to
obviously run different programs (with different arguments) on the
server. However, SFTP makes it more straightforward for the server
to implement things differently because the client explicitly asks
'please talk this documented protocol with me'; unlike with
it is the server that decides to implement the protocol by running
sh -c sftp-server'. OpenSSH
sshd has an option to implement
SFTP internally, and you could easily write an alternate SSH daemon
that handled SFTP in a different way.
It's theoretically possible to handle
scp in a different way in
your SSH server, but you would have to recognize
scp by knowing
that a request to run the command '
scp -t <something>' was special.
This is not unheard of; git operates over
SSH by running internal git commands on the remote end (cf), and
so if you want to provide remote git repositories without exposing
full Unix accounts you're going to have to interpret requests for
those commands and do special things. Github
does this along with other magic, especially since everyone uses
the same remote SSH login (that being email@example.com).
Our (Unix) staff groups problem
Last week, I tweeted:
The sysadmin's lament: I swear, I thought this was a small rock when I started turning it over.
As you might suspect, there is a story here.
Our central Unix systems have a lot of what I called system continuity; the lineage of some elements of what we have today traces back more than 25 years (much like our machine room). One such element is our logins and groups, because if you're going to run Unix for a long time with system continuity, those are a core part of it.
(Actual UIDs and GIDs can get renumbered, although it takes work, but people know their login names. Forcing changes there is a big continuity break, and it usually has no point anyway if you're going to keep on running Unix.)
Most people on our current Unix systems have a fairly straightforward Unix group setup for various reasons. The big exception is what can broadly be called 'system staff', where we have steadily accumulated more and more groups over time. Our extended system continuity means that we have lost track of the (theoretical) purpose of many groups, which have often wound up with odd group membership as well. Will something break if we add or remove some people from a group that looks almost right for what we need now? We don't know, so we make a new group; it's faster and simpler than trying to sort things out. Or in short, we've wound up with expensive group names.
This was the apparently small rock that I was turning over last week. The exact sequence is beyond the scope of this entry, but it started with 'what group should we put this person in now', escalated to 'this is a mess, let's reform things and drop some of these obsolete groups', and then I found myself writing long worklog messages about peculiar groups I'd found lurking in our Apache access permissions, groups that actually seemed to duplicate other groups we'd created later.
(Of course we use Unix groups in Apache access permissions. And in Samba shares. And in CUPS print permissions. And our password distribution system. And probably other places I haven't found yet.)
The continuity of broad systems or environments in system administration
What I'm going to call system continuity for my lack of a better phrase is the idea that some of the time in some places, you can trace bits of your current environment back through your organization's history. You're likely not still using the same physical hardware, you're probably not using the same OS version and possibly even the OS, the software may have changed as may have how you administer it, but you can point at elements of how things work today and say 'they're this way because N years ago ...'. To put it one way, system continuity means that things have a lineage.
As an example, you have some system continuity in your email system if you're still supporting and using people's email addresses from your very first mail system you set up N years ago, even though you moved from a basic Unix mailer to Exchange and now to a cloud-hosted setup. You don't have system continuity here if at some point people said 'we're changing everyone's email address and the old ones will stop working a year from now'.
You can have system continuity in all sorts of things, and you can lack it in all sorts of things. One hallmark of system continuity is automatic or even transparent migrations as far as users are concerned; one marker for a lack of it is manual migrations. If you say 'we've built a new CIFS server environment, here are your new credentials on it, copy data from our old one yourself if you want to keep it', you probably don't have much system continuity there. System continuity can be partial (or perhaps 'fragmented'); you might have continuity in login credentials but not in the actual files, which you have to copy to the new storage environment yourself.
(It's tempting to say that some system continuities are stacked on top of each other, but this is not necessarily the case. You can have a complete change of email system, including new login credentials, but still migrate everyone's stored mail from the old system to the new one so that people just reconfigure their IMAP client and go on.)
Not everyone has system continuity in anything (or at least anything much). Some places just aren't old enough to have turned over systems very often; they're still on their first real system for many things and may or may not get system continuity later. Some places don't try to keep anything much from old systems for various reasons, including that they're undergoing a ferocious churn in what they need from their systems as they grow or change directions (or both at once). Some places explicitly decide to discard some old systems because they feel they're better off re-doing things from scratch (sometimes this was because the old systems were terrible quick hacks). And of course some organizations die, either failing outright or being absorbed by other organizations that have their own existing systems that you get to move to. Especially in today's world, it probably takes an unusually stable and long-lived organization to build up much system continuity. Unsurprisingly, universities can be such a place.
(Within a large organization like a big company or a university, system continuity is probably generally associated with the continuity of a (sub) group. If your group has its own fileservers or mail system or whatever, and your group gets dissolved or absorbed by someone else, you're likely going to lose continuity in those systems because your new group probably already has its own versions of those. Of course, even if groups stay intact there can be politics over where services should be provided and who provides them that result in group systems being discarded.)
Understanding the .io TLD's DNS configuration vulnerability
First there was Matthew Bryant's The .io Error - Taking Control of All .io Domains With a Targeted Registration, about a configuration error that allegedly allowed you to take over control of some .io nameservers, and then there was a response to it, Matt Pounsett's The .io Error: A Problem With Bad Optics, But Little Substance, which argued that this was much ado about nothing much. While I agree that the consequences are less severe than Bryant thought, I think that Pounsett's article understates the risks itself (and I believe doesn't correctly explain what's going on in the DNS here). In any case, the whole thing confused me and other people, so I'm going to write my understanding of things up here.
Let's start with the basics of compromising a domain through dangling
nameserver delegation. Suppose you find a domain
ns1.fred.ly as one of its two nameservers, and
is not registered (worse nameserver mistakes happen).
barney.io you register
fred.ly and create a
A record that points to a nameserver that you're running. Some
portion of the people looking up information in
wind up querying your nameserver, and at that point you can give
them whatever answers you want. If they're asking their original
question, you can directly lie to
them (telling people that all
MX entries in
barney.io point to
harvestmail.fred.ly, for example). If they're making
to check for zone delegation, you can just give them
that point to you and start lying some more when they follow those
(You can then increase how many people will talk to
by DOSing the other
barney.io DNS server
off the Internet.)
This is more or less what the setup was for
ns-a4.io, and all of those
names could be registered as domains in
.io and then given
records in your DNS data for your new domain(s) (and Matthew Bryant
did just this with
ns-a1.io). However, there was an important
difference that made this less severe than my example, and that's
.io had active glue records in
the root zone for those names that pointed people to the IP addresses
of the real nameservers. With these glue records present, a client
didn't talk to Matthew Bryant's DNS server just because it decided
ns-a1.io as part of resolving a
.io name; if it believed
and used the glue records, it would wind up talking to the real
nameserver. You only had your query diverted to Bryant's DNS server
if you decided to send a query to
ns-a1.io but not use the IP
from the glue record and instead look it up directly.
Using data from glue records instead of looking things up yourself
is common but not mandatory, and there are various reasons why a
resolver would not do so. Some recursive DNS servers will deliberately
try to check glue record information as a security measure; for
example, Unbound has the
harden-referral-path option (via Tony Finch). Since the
reported seeing real
.io DNS queries being directed to Bryant's
DNS server, we know that a decent number of clients were not using
the root zone glue records. Probably a lot more clients were still
using the glue records, through.
(There are a bunch of uncertainties about just what DNS data was
being returned by who during the incident. The original article
shows a reply from a root server and that probably didn't change,
but we don't know what the official
.io servers themselves started
returning as glue records for
.io during the time that
was active as a domain registration. I will decline to speculate on
what was the likely result here.)
Given my history with glue record hell,
it amuses me that this is a case where dangling glue records helped
instead of hurt, making a problem less severe than it would otherwise
have been. Had there been no glue records or incomplete glue records
.io zone, there would have been more danger (or at least
the danger would have been more clearer).
(In this case the presence of the glue records was mandatory, since
NS names inside the zone itself. Without glue records
in the root zone, you would have a chicken and egg problem in getting
the IP address of, say,
PS: As far as I can see from Bryant's article, he didn't realize
that the root zone glue records would cause many clients to not
query his DNS servers, significantly reducing the severity of someone
having control over the names of four of the seven
.io DNS servers.
As far as Pounsett's article goes, he appears to more or less spot
the issue with root glue but doesn't explain it and appears to
expect all clients to use the glue all of the time (which is
demonstrably not the case). I think he may also be confusing the
data in the
.io zone with the root zone glue for
.io. Note that
it's not necessary to get your IP address for
.io zone; to make some clients start talking to you, it's
NS records for
ns-a1.io to show up and ideally
to occlude the
(We know that Bryant's
NS records showed up in the
We don't know if they occluded the
A record for
was there, but it seems likely that they did.)
Sidebar: What I suspect went wrong in
It seems quite likely that
intended to be purely host names of DNS servers, not domain names,
much like my example of
ns1.fred.ly. However, they were placed
directly in the apex of a zone (
.io) that allows people to register
domains, and I suspect that the people running the IO zone forgot
to tell the people running the IO registry that these names existed
in the zone as host names and should be locked out from domain
registration. That's been fixed now, obviously, and WHOIS tells
me they're 'Reserved by Registry'.
(This is thus a different failure mode than having
NS records for
your domain or TLD that point to hosts in entirely unregistered
domains. That's a pure failure, since the names don't exist at
all except perhaps through lingering glue records.
Here the names existed entirely properly, it's just that the IO
registry was allowed to override them with new data.)
The problem doesn't come up for the other
.io nameservers, which
are all under
nic.io is already a registered
Recursive DNS servers send the whole original query to authoritative servers
As a long term sysadmin, I usually feel that I have a solid technical grasp of DNS (apart from DNSSEC, which I ignore on principle). Then every so often, I get to find out that I'm wrong. Today is one of those days.
Before today, if you had asked me how a recursive DNS server did a lookup from authoritative servers, I would have told you what is basically the standard story. If you're looking up the A record for fred.blogs.example.com, your local recursive server first asks a random root server for the NS records for .com, then asks one of the .com DNS servers for the NS records for example.com, then asks one of those theoretically authoritative DNS servers, and so on. Although this describes the chain of NS delegations that your recursive DNS server typically gets back, it turns out that this doesn't accurately describe what your server usually sends as its queries. The normal practice today is that your recursive DNS server sends the full original query to each authoritative server. It doesn't ask the root servers 'what is the NS for .com'; instead it asks the root servers 'what is the A record for fred.blogs.example.com', and they send back an answer that is basically 'I have no idea, ask one of these .com nameservers'.
Once I thought about it, this behavior made a lot of sense because
DNS clients don't know in advance where zone delegation boundaries
are. It's common for there to be zone boundaries between each
., but it's not always the case; you can certainly have zones
where the zone boundaries are further apart. You can even have zones
where it varies by the name. Consider a hypothetical
operator who allows registration of both <domain>.<province>.ca (eg
fred.on.ca) and <domain>.ca (eg
bob.ca), and does not have
separate <province>.ca zones; they just carry all of the data in one
zone without internal NS records. Here
bob.ca has a NS but
doesn't, and your client certainly can't know which is which in advance.
When the client has no idea where the zone boundaries are, the simple
thing to do is to send the whole original query off to each step of the
delegation chain and see what they say. This way you don't have to try
any sort of backtracking when you ask for a NS for
.on.ca and get a
no data answer back.
Now, you might ask if sending the full query to all DNS servers in
the chain like this has privacy implications. Why yes, it does, and there are
proposed ways to work around it, such as RFC 7816 query minimization. Some DNS servers are already
taking steps here; for example, current versions of Unbound have the
qname-minimisation option in
Moving to smaller fileservers for us probably means no more iSCSI SAN
In our environment, one major thing that drives us towards relatively big fileservers is aggregating and lowering the overhead of servers. Regardless of how big or small it is, any server has a certain minimum overhead cost due to needing things like a case and power supply, a motherboard, and a CPU. The result of this per-server overhead is economies of scale; a single server with 16 disk bays almost certainly costs less than two servers with 8 disk bays each.
We have a long history of liking to split our fileservers from our disk storage. Our current fileservers and our past generation of fileservers have both used iSCSI to talk to backend disk enclosures, and the generation of fileservers before them used Fibre Channel to talk to FC hardware RAID boxes. Splitting up the storage from the fileservice this way requires buying extra machines, which costs more; what has made this affordable is aggregating a fairly decent amount of disks in each box, so we don't have to buy too many extra ones.
If we're going to have smaller fileservers, as I've come to strongly believe we want, we're going to need more of them. If we're going to keep a similar design to our current setup, we would need more iSCSI backends to go with them. All of this means more machines and more costs. In theory we could lower costs by continuing to use 16-disk backends and share them between (smaller) fileservers (so two new fileservers would share a pair of backends), but in practice this would make our multi-tenancy issues worse and we would likely resist the idea fairly strongly. And we'd still be buying more fileservers.
If we want to buy a similar number of machines in total for our next generation fileservers but shrink the number of disks and the amount of storage space supported by each fileserver, the obvious conclusion is that we must get rid of the iSCSI backends. Hosting disks on the fileservers themselves has some downsides (per my entry about our SAN tradeoffs), but at a stroke it cuts the number of machines per fileserver from three to one. We could double the number of fileservers and still come out ahead on raw machine count. In a 10G environment, it also eliminates the need for two expensive 10G switches for the iSCSI networks themselves (and we'd want to go to 10G for the next generation of fileservers).
If we want to reduce the size of our fileservers but keep an iSCSI environment, we're almost certainly going to be faced with unappetizing tradeoffs. Considering the cost of 10G switch ports as well as everything else, our most likely choice would be to stop using two backends per fileserver; instead each fileserver would talk to a single 16-disk iSCSI backend (still using mirrored pairs of disks). This would increase the overall number of servers, but not hugely (we would go from 9 servers total for our HD-based production fileservers to 12 servers; the three fileservers would become six, and then we'd need six backends to go with them).
(It turns out that I also wrote about this issue a couple of years ago. At the time we weren't as totally convinced that our current fileservers are too big as designed, although we were certainly thinking about it, and I was less pessimistic about the added costs for extra servers if we shrink how big each fileserver is and so need more of them. (Or maybe the extra costs just hadn't struck me yet.))
Our current generation fileservers have turned out to be too big
We have three production hard drive based NFS fileservers (and one special fileserver that uses SSDs). As things have turned out, usage is not balanced evenly across all three; one of them has by far the most space assigned and used on it and unsurprisingly is also the most active and busy fileserver.
(In retrospect, putting all of the pools for the general research group that most heavily uses our disk space on the same fileserver was perhaps not our best decision ever.)
It has been increasingly obvious to us for some time that this fileserver is simply too big. It hosts too much storage that is too actively used and as a result it's the server that most frequently encounters serious system issues and in general it frequently runs sufficiently close to its performance edge that a little extra load can push it over the edge. Even when everything is going well it's big enough to be unwieldy; scheduling anything involving it is hard, for example, because so many people use it.
(This fileserver also suffers the most from our multi-tenancy, since so many of its disks are used for so many active things.)
However this fileserver is not fundamentally configured any differently than the other two. It doesn't have less memory or more disks; it simply makes more use of them than the other two do. This means that all three of our fileservers are too big as designed. The only reason the other two aren't also too big in operation today is that not enough people have been interested in using them, so they don't have anywhere near as much space used and aren't as active in handling NFS traffic.
Now, how we designed our fileservers is not quite how they've wound up being operated, since they're running at 1G for all networking instead of 10G. It's possible that running at 10G would make this fileserver not too big, but I'm not particularly confident about that. The management issues would still be there, there would still be a large impact on a lot of people if (and when) the fileserver ran into problems, and I suspect that we'd run into limitations on disk IOPS and how much NFS fileservice a single machine can do even if we went to the extreme where all the disks were local disks instead of iSCSI. So I believe that in our environment, it's most likely that any fileserver with that much disk space is simply too big.
As a result of our experiences with this generation of fileservers, our next generation is all but certain to be significantly smaller, just so something like this can't possibly happen with them. This probably implies a number of other significant changes, but that's going to be another entry.
Why big Exim queues are a problem for us in practice
In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.
(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)
In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.
(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)
In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.
(A surge of delivery to
/var/mail is more tolerable for various
reasons, and a surge of delivery to external addresses is pretty
close to 'we don't care unless the queue becomes absurdly large'.
Well, apart from the bit where it might be spam and high outgoing
volumes might get our outgoing email temporarily blacklisted in
Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.