Plan for manual emergency blocks for your overall mail system
Last year, I wrote about how your overall anti-spam system should have manual emergency blocks. At the time I was only thinking about incoming spam, but after some recent experiences here, let me extend that and say that all entry points into your overall mail system should have emergency manual blocks. This isn't just about spam or bad mail from the outside, or preventing outgoing spam, although those are important things. It's also because sometimes systems just freak out and explode, and when this happens your mail system can get deluged as a result. Perhaps a monitoring system starts screaming in email, sending thousands of messages over a short span of time. Perhaps someone notices that a server isn't running a mailer and starts it, only to unleash several months worth of queued email alerts from said server. Perhaps some outside website's notification system malfunctions and deluges some of your users (or many of them) with thousands of messages.
(There are even innocent cases. Email between some active upstream email source (GMail, your organization's central email system, etc) and your systems might have clogged up, and now that the clog has been cleared the upstream is trying to unload all of that queued email on you as fast as it can. You may want some mechanisms in place to let you slow down that incoming flood once you notice it.)
We now have an initial set of blocks, but I'm not convinced that they're exactly what you should have; our current blocks are partly a reaction to the specific incidents that happened to us and partly guesswork about what we might want in the future. Since anticipating the exact form of future explosions is somewhat challenging, our guesswork is probably going to be incomplete and imperfect. Still, it beats nothing and there's value in being able to stop a repeat incident.
(Our view is that we've built are some reasonably workable but crude tools for emergency use, tools that will probably require on the spot adjustment if and when we have to turn them on. We haven't tried to build reliable, always-on mechanisms similar to our anti-spam internal ratelimits.)
We have a reasonably complicated mail system with multiple machines running MTAs; there's our inbound MX gateway, two submission servers, a central mail processing server, and some other bits and pieces. One of the non-technical things we've done in the fallout from the recent incidents is to collect in one spot the information about what you can do on each of them to block email in various ways. Hopefully we will keep this document updated in the future, too.
(You may laugh, but previously the information was so dispersed that I actually forgot that one blocking mechanism already existed until I started doing the research to write all of this up in one place. This can happen naturally if you develop things piecemeal over time, as we did.)
Sidebar: Emergency tools versus routine mechanisms
Some people would take this as a sign that we should have always-on mechanisms (such as ratelimits) that are designed to automatically keep our mail system from being overwhelmed no matter what happens. My own view is that designing such mechanisms can be pretty hard unless you're willing to accept ones that are set so low that they have a real impact in normal operation if you experience a temporary surge.
Actually, not necessarily (now that I really think about it). It is in our environment, but that's due to the multi-machine nature of our environment combined with some of our design priorities and probably some missing features in Exim. But that's another entry.