2023-04-29
A crontab related mistake you can make with internal email ratelimits
Due to past painful experiences, we've given our email system a collection of internal ratelimits on various things, such as how much email a single machine can send at a time. When the ratelimit is hit, Exim will temporarily reject the email with a SMTP 4xx series error, so that (in theory) no email will actually be lost, only delayed (and someone who's caused their machine to suddenly send them thousands of email messages has a chance to fix it before being overwhelmed). When I set up these ratelimits in Exim, I set them to what seemed to be a perfectly reasonable limit of '60 messages in 60 minutes', which averages to one message every minute while allowing for burst of sending activity (this is a tradeoff you make with the ratelimit duration). Today, I discovered that this is a little bit of a mistake and that we actually want to set our ratelimits for a bit higher than one email message a minute.
Suppose, not entirely hypothetically, that something has a crontab job that runs once a minute and that has started to generate output every time it runs, which cron will email to the crontab's owner (or MAILTO setting). This means that this machine is now running right at the edge of its sending ratelimit; if it generates even one more email message (for example from some other cron job that runs once a day to notify you about pending package updates on that machine), it will hit the ratelimit and have an email message stalled. Once even a single message stalls, this machine will never recover and will always have something in its local mail queue. If it sends a second extra email, you'll wind up with the local mail queue always having two things waiting, and so on.
(These may or may not wait for very long, depending on how the machine's local mailer behaves.)
In our environment, we want our machines to clear their local mail queues unless there's been an explosion. In a relatively 'normal' situation like this, we'd prefer that the email get delivered rather than have potentially random email get delayed on the machine. As a result, we've discovered that we need to raise our ratelimits so that they're a bit above one message a minute on average. For now, we're using 70 messages in 60 minutes ('70 / 60m' in Exim's format for ratelimits).
(This elaborates on a Fediverse post.)