We need to deploy anti-spam precautions even if they're a bit imperfect
A few years ago we had a local spam incident. In its wake, we made some configuration changes and started exploring things like ratelimiting outgoing email. Our first step in this was to set our Exim configuration to track rate limits without enforcing them, so that we could figure out what limits to set that would stop spammers without causing problems for our users.
At one level, this was a sensible decision. Causing disruptions to our users might create political pressure that would stop us from taking any precautions against future spam runs from compromised local accounts. But, well, we never really found a level that our users didn't run over once in a blue moon, and when the overruns only happen once in a blue moon it's hard to iterate on tuning limits. And so things sat there from 2012 until today.
Today we had another little local spam incident (as people on Twitter might have guessed). Our ratelimit tracking code dutifully logged that this was happening and that our hypothetical ratelimits were being exceeded, and by quite a bit too:
[...] Warning: SENDER RATE LIMIT HIT: 27943.5 / 60m max 200 / [...]
It may not surprise you to hear that now we have some active ratelimits; in fact we more or less simply made our previous tracking ratelimits into enforcing ones. They're undoubtedly not perfect ratelimits, and in fact I'm fairly sure that within six months someone here sending out an entirely legitimate burst of email will run into them. But as usual the perfect is the enemy of the good. Our quest to deploy only perfect anti-spam precautions that would never inconvenience our users turned out to result in us deploying almost no anti-spam precautions, with regrettable results.
(Nor did we avoid inconveniencing users, since some of them had email bounce due to the machine in question temporarily picking up a bad sender reputation.)
We don't want to deploy significantly imperfect anti-spam precautions, for obvious reasons. Something that gets in the way of our users on a frequent basis is no good. But I've come around to the view that we need to be more willing to deploy things that are a bit imperfect and then sort out the problems when they happen. Otherwise, well, we may be looking at something like this happening all over again.
(One of those may be some sort of scanning of our outgoing email, or at least some of it. Despite my historical reservations, I now think it's possible to do this in a good way and I think that the risks of false positives may be one of those 'a bit imperfect' things we can live with, at least initially. But right now I'm kind of thinking out loud in the immediate aftermath of an incident, which gives me some biases.)