Wandering Thoughts archives


Our current generation fileservers have turned out to be too big

We have three production hard drive based NFS fileservers (and one special fileserver that uses SSDs). As things have turned out, usage is not balanced evenly across all three; one of them has by far the most space assigned and used on it and unsurprisingly is also the most active and busy fileserver.

(In retrospect, putting all of the pools for the general research group that most heavily uses our disk space on the same fileserver was perhaps not our best decision ever.)

It has been increasingly obvious to us for some time that this fileserver is simply too big. It hosts too much storage that is too actively used and as a result it's the server that most frequently encounters serious system issues and in general it frequently runs sufficiently close to its performance edge that a little extra load can push it over the edge. Even when everything is going well it's big enough to be unwieldy; scheduling anything involving it is hard, for example, because so many people use it.

(This fileserver also suffers the most from our multi-tenancy, since so many of its disks are used for so many active things.)

However this fileserver is not fundamentally configured any differently than the other two. It doesn't have less memory or more disks; it simply makes more use of them than the other two do. This means that all three of our fileservers are too big as designed. The only reason the other two aren't also too big in operation today is that not enough people have been interested in using them, so they don't have anywhere near as much space used and aren't as active in handling NFS traffic.

Now, how we designed our fileservers is not quite how they've wound up being operated, since they're running at 1G for all networking instead of 10G. It's possible that running at 10G would make this fileserver not too big, but I'm not particularly confident about that. The management issues would still be there, there would still be a large impact on a lot of people if (and when) the fileserver ran into problems, and I suspect that we'd run into limitations on disk IOPS and how much NFS fileservice a single machine can do even if we went to the extreme where all the disks were local disks instead of iSCSI. So I believe that in our environment, it's most likely that any fileserver with that much disk space is simply too big.

As a result of our experiences with this generation of fileservers, our next generation is all but certain to be significantly smaller, just so something like this can't possibly happen with them. This probably implies a number of other significant changes, but that's going to be another entry.

FileserversDesignedTooBig written at 23:03:27; Add Comment

Why big Exim queues are a problem for us in practice

In light of my recent entry on how our mail system should probably be able to create backpressure, you might wonder why we even need to worry about 'too large' queue sizes in the first place. Exim generally performs quite well under load and doesn't have too many problems dealing with pretty large queues (provided that your machines have enough RAM and fast enough disks, since the queue lives on disk in multiple files). Even in our own mail system we've seen queues of a few thousand messages be processed quite fast and without any particular problem.

(In some ways this speed is a disadvantage. If you have an account compromise, Exim is often perfectly capable of spraying out large amounts of spam email much faster than you can catch and stop it.)

In general I think you always want to have some sort of maximum queue size, because a runaway client machine can submit messages (and have Exim accept them) at a frightening speed. Your MTA can't actually deliver such an explosion anywhere near as fast as the client can submit more messages, so sooner or later you will run into inherent limits like overly-large directories that slow down everything that touches them or queue runners that are spending far too long scanning through hundreds of thousands of messages looking for ones to retry.

(A runaway client at this level might seem absurd, but with scripts, crontab, and other mistakes you can have a client generate tens of complaint messages a second. Every second.)

In our environment in specific, the problem is local delivery, especially people who filter local delivery for some messages into their home directories. Our NFS fileservers can only do so many operations a second, total, and when you hit that limit everyone starts being delayed, not just the MTA (or the server the MTA is running on). If a runaway surge of email is all directed to a single spot or to a small number of spots, we've seen the resulting delivery volume push an already quite busy NFS fileserver into clear overload, which ripples out to many of our machines. This means that a surge of email doesn't just affect the target of the surge, or even our mail system in general; under the wrong circumstances, it can affect our entire environment.

(A surge of delivery to /var/mail is more tolerable for various reasons, and a surge of delivery to external addresses is pretty close to 'we don't care unless the queue becomes absurdly large'. Well, apart from the bit where it might be spam and high outgoing volumes might get our outgoing email temporarily blacklisted in general.)

Ironically this is another situation where Exim's great efficiency is working against us. If Exim was not as fast as it is, it would not be able to process so many deliveries in such a short amount of time and thus it would not be hitting our NFS fileservers as hard. A mailer that maxed out at only a few local deliveries a second would have much less impact here.

EximWhyBigQueuesProblem written at 00:47:24; Add Comment


Our MTAs should probably be able to create backpressure

Suppose that you have a complicated mail system, as we do, one with several MTAs and machines involved and mail coming in from multiple directions (inside users, inside machines, outside machines, etc). You would like some always-on precautions (such as ratelimits) that will keep you from being overwhelmed by genuine problems while not harming either normal operations or temporary surges. This sounds hard, but I recently realized that there is probably a general mechanism that will work for it.

If you look at it from the right angle, a multi-machine, multi-MTA environment is a bunch of distributed queues. At an abstract level email moves through this system in much the same way that things move through other multi-hop queue-based systems; everything is queues. We have a lot of experience with queues in programming and system design, and to condense things a lot, we have learned the hard way that simple queue-based systems can easily fly to pieces under overload (for example, see this article).

One of the classical ways of protecting queueing systems from explosive failure under overload conditions is backpressure. When one queue is overloaded, it pushes back by no longer accepting new queue entries. This pressure may ripple back immediately through the system, or it may lead to other queues hitting overload later and pushing back. Backpressure works, and it's often considered essential in robust queue-based distributed systems.

You can see the conclusion here: our multi-MTA setup should be able to apply backpressure. When the queue on one machine gets to be abnormally big, it should start deferring all attempted incoming email (with SMTP 4xx responses). When this happens on an edge machine, such as the external MX gateway, this will naturally push back against the ultimate source of the traffic. When it happens on our central mail handling machine it will probably wind up causing queues on the edge machines to start filling up, which will prompt them to put backpressure on incoming messages. If we set the queue sizes reasonably right, we should be able to not block ordinary mail, deal with temporary surges, and automatically slow down real problems (especially sudden bursts of messages).

Or at least that's the theory. In practice there are tradeoffs involved in not holding email yourself. For the external MX gateway, applying backpressure to senders leaves us at the mercy of their policies on retries and message expiry. On the optimistic side, from the sender's viewpoint backpressure is basically the same as graylisting, and given that graylisting is reasonably common it's very likely that senders cope with it these days. For mail submission servers the issues are more complicated, because mail clients may have various issues if we don't accept their email. However we're already sort of prepared to deal with them since we've already deployed some ratelimiting. We could also limit this backpressure to the mail submission machine that handles unauthenticated connections, since today these tend to be from internal machines (users should be switching to our authenticated SMTP submission machine). Internal machines are historically the most likely source of a sudden flood of email that we want to push back against, and they should mostly handle backpressure. Only mostly, though; there are probably some systems that are not actually using mailers and so will probably just drop deferred email on the floor.

(Clearly the solution for those special systems is a special mail submission machine or IP address that's only used by them and by nothing else. This machine would always accept email, no matter how broken.)

On top of these general practical issues, there is the issue that I don't think Exim has any easy way to do things based on the current queue size. Exim has some load limiters, but these are based on system load average and connection counts. Nor does it look like Exim exposes the current queue size as a variable (or condition). One could probably ${run} a command to look at it, but that seems like a hack at best.

MTABackpressureNeed written at 21:53:00; Add Comment


Thinking through issues a mail client may have with SMTP-time rejections

In response to my entry on who holds problematic email, Evaryont left a comment lamenting the general 'don't trust the user's mail client' approach of mail submission servers accepting more or less everything from clients and bouncing things later. This has prompted me to try to think about the issues involved, so today you get this entry in reaction.

First, the basics. We already do our best to verify SMTP sender addresses and immediately reject bad ones, for good reasons, and I hope that there's general agreement on this for sender addresses, so this is only about how we (and other people) should react to various sorts of bad destination addresses (SMTP RCPT TOs). For 'bad' destination addresses, there are two different cases; sometimes you know that the destination address is bad and could give a permanent SMTP rejection, and sometimes you can't verify the destination address and would give a SMTP temporary failure (a 4xx deferral).

For permanent rejections, the question is whether the user's mail client will do a better job of telling them about these bad addresses than your bounce message does. This is a straightforward question of UI design (and whether the mail clients expect rejection at all, which it really should and these days almost certainly does). In theory a mail client can do a substantially better job of helping the user deal with bad addresses than a bounce can; for example, the mail client could immediately abort message submission entirely, report to the user 'these addresses are bad, I have marked them in red for you', and give the user the opportunity to correct or remove them before (re-)sending the message. In practice a bounce may give the user a better record of the failures than, say, a temporary popup dialog box about 'these addresses failed' that gives them no way to record the information or react to it.

(Correcting or removing the bad addresses before the message is sent at all is an overall better experience for everyone involved in the email; consider future replies to all addresses, for example. Bounces are also much less convenient for correcting bad addresses and resending your message, since there's no straightforward path from the bounce to sending a new copy of the original to corrected addresses.)

For temporary deferrals, things get a lot more complicated in both mail client handling and UI design. Some temporary deferrals will be cured in time; if they are pushed back to the client, the client must maintain its own queue of messages and addresses to be re-submitted, manage scheduling further delivery attempts, and decide on warning messages after sufficient time. For many clients this is going to be complicated by intermittent and perhaps variable connectivity (where you're on the Internet but you can't talk to the original mail submission server). Some temporary deferrals will never be cured in time, and for them the client also has to eventually give up and somehow present this to the user to do something (alternately, the client can just let the user keep trying endlessly until the user themselves clicks a 'stop trying, give up' button). Notifying the user at all about initial temporary deferrals is potentially a bad idea, especially with an intrusive alert; unlike permanent rejections, this is not something the user really needs to deal with right away.

(The mail client could also immediately abort message submission when it gets temporary deferrals and give the user a chance to change or remove addresses, but it's not clear that this is the right choice. There are a lot of things that can cause curable temporary deferrals (in fact some may be cured in seconds, when DNS results finally show up and so on), and you probably don't want to not send your message to such addresses.)

Reliably maintaining queues and handling retries is fairly complicated, especially for a mail client that may only be run intermittently and have network connectivity only some of the time. My guess is that mail servers are probably in a much better position to do this most of the time, and for temporary deferrals that will be rapidly cured (for example, ones caused by slow to respond DNS servers) a mail server will probably get the message delivered sooner. Also, when the mail client is the one to handle temporary deferrals, it's going to wind up having to send (much) more data over its connection, especially if the message has multiple temporarily deferred destinations and they cure themselves at different times. Having the server handle all retries means that the server holds the message and the mail client only has to upload it to the server once.

(On mobile devices these extra message submissions are also going to burn more battery power, as Aristotle Pagaltzis noted in the context of excessive web page fetching in a comment on this recent entry.)

MUAIssuesWithRejection written at 23:28:02; Add Comment

One tradeoff in email system design is who holds problematic email

When you design parts of a mail system, for example a SMTP submission server that users will send their email out through or your external MX gateway for inbound email, you often face a choice of whether your systems should accept email aggressively or be conservative and leave email in the hands of the sender. For example, on a submission server should you accept email from users with destination addresses that you know are bad, or should you reject such addresses during the SMTP conversation?

In theory, the SMTP RFCs combined with best practices give you an unambiguous answer; here, the answer would be that clearly the submission server should reject known-bad addresses at SMTP time. In practice things are not so simple; generally you want problematic email handled by the system that can do the best job of dealing with it. For instance, you may be extremely dubious about how well your typical mail client (MUA) will handle things like permanent SMTP rejections on RCPT TO addresses, or temporary deferrals in general. In this case it can make a lot of sense to have the submission machine accept almost everything and sort it out later, sending explicit bounce messages to users if addresses fail. That way at least you know that users will get definite notification that certain addresses failed.

A similar tradeoff applies on your external MX gateway. You could insist on 'cut-through routing', where you don't say 'yes' during the initial SMTP conversation until the mail has been delivered all the way to its eventual destination; if there's a problem at some point, you give a temporary failure and the sender's MTA holds on to the message. Or you could feel it's better for your external MX gateway to hold inbound email when there's some problem with the rest of your mail system, because that way you can strongly control stuff like how fast email is retried and when it times out.

Our current mail system (which is mostly described here) has generally been biased towards holding the email ourselves. In the case of our user submission machines this was an explicit decision because at the time we felt we didn't trust mail clients enough. Our external MX gateway accepted all valid local destinations for multiple reasons, but a sufficient one is that Exim didn't support 'cut-through routing' at the time so we had no choice. These choices are old ones, and someday we may revisit some of them. For example, perhaps mail clients today have perfectly good handling of permanent failures on RCPT TO addresses.

(A accept, store, and forward model exposes some issues you might want to think about, but that's a separate concern.)

(We haven't attempted to test current mail clients, partly because there are so many of them. 'Accept then bounce' also has the benefit that it's conservative; it works with anything and everything, and we know exactly what users are going to get.)

WhoHoldsEmailTradeoffs written at 01:01:13; Add Comment


How I'm currently handling the mailing lists I read

I recently mentioned that I was going to keep filtering aside email from the few mailing lists that I'm subscribed to, instead of returning it to being routed straight into my inbox. While I've kept to my decision, I've had to spend some time fiddling around with just how I was implementing it in order to get a system that works for me in practice.

What I did during my vacation (call it the vacation approach) was to use procmail recipes to put each mailing list into a file. I'm already using procmail, and in fact I was already recognizing mailing lists (to insure they didn't get trapped by anti-spam stuff), so this was a simple change:

* ^From somelist-owner@...

This worked great during my vacation, when I basically didn't want to pay attention to mailing lists at all, but once I came back to work I found that filing things away this way made them too annoying to deal with in my mail environment. Because MH doesn't deal directly with mbox format files, I needed to go through a whole dance with inc and then rescanning my inbox and various other things. It was clear that this wasn't the right way to go. If I wanted it to be convenient to read this email (and I did), incoming mailing list messages had to wind up in MH folders. Fortunately, procmail can do this if you specify '/some/directory/.' as the destination (the '/.' is the magic). So:

* ^From somelist-owner@...

(This is not quite a complete implementation, because it doesn't do things like update MH's unseen sequence for the folder. If you want these things, you need to pipe messages to rcvstore instead. In my case, I actually prefer not having an unseen sequence be maintained for these folders for various reasons.)

The procmail stuff worked, but I rapidly found that I wanted some way to know which of these mailing list folders actually had pending messages in them. So I wrote a little command which I'm calling 'mlists'. It goes through my .procmailrc to find all of the MH destinations, then uses ls to count how many message files there are and reports the whole thing as:

:; mlists
+inbox/somelist: 3

If there's enough accumulated messages to make looking at the folder worthwhile, I can then apply standard MH tools to do so (either from home with standard command line MH commands, or with exmh at work).

It's early days with this setup, but so far I feel satisfied. The filtering and filing stuff works and the information mlists provides is enough to be useful but sufficiently minimal to push me away from actually looking at the mailing lists for much of the time, which is basically the result that I want.

PS: there's probably a way to assemble standard MH commands to give you a count of how many messages are in a folder. I used ls because I couldn't be bothered to read through MH manpages to work out the MH way of doing it, and MH's simple storage format makes this kind of thing easy.

MailingListsHandling-2017-06 written at 00:17:57; Add Comment


Why filing away mailing lists for a while has improved my life

I've been on vacation for the past little while. As part of this vacation, I carried out my plans to improve my vacations, part of which was using procmail to divert messages from various mailing lists off to files instead of having them delivered to my inbox as I usually do. I started out only doing this to mailing lists for work software, like Exim and OmniOS, but as my vacation went on I added the mailing lists for other things that I use. As I hoped and expected, this worked out quite well; I soon got over my urge to check in on the mailing lists and mostly ignored them.

Recently I came to a realization about why this feels so good. It's not specifically that it's reduced the volume of email in my inbox; instead, the really important thing it's done is that right now, pretty much everything that shows up in my inbox is actually important to me. It's email from friends and family, notifications that I care about getting, and so on.

(Coming to this realization and writing it up has sharpened my awareness that some of the remaining email going to my inbox doesn't make this bar, and thus should also be filed away on future breaks and vacations.)

There's nothing wrong with the emails from those mailing lists. They're generally perfectly interesting. But right now (and in general) the mailing list email is not important in that way. It's not something that I care about. When it all was going into my inbox, a significant amount of my inbox was stuff that I didn't really care about. That doesn't feel good (and has other effects). Now my inbox is very pared down; it's either silent and empty, or the new email is something that I actively want to read because it matters to me.

(In other words, it's not just that processing my inbox is faster now, it's that the payoff from doing so is much higher. And when there is no payoff, there's no email.)

If I'm being honest about these mailing lists, most of this is going to be true even when I go back to work tomorrow morning. Sure, if I've just asked a question or gotten into a conversion, reading the mailing list immediately usually has a relatively high payoff. But at other times, the payoff is much lower and having the mailing lists go straight to my inbox is just giving me a slow drizzle of low-priority, low-payoff email that I wind up having to pay some attention to.

In fact I think a drizzle is a good analogy here. Like the moment to moment experience of biking in a light drizzle, the individual emails are not particularly onerous or bad. But the cumulative result of staying out in that light drizzle is that you quietly wind up soaked, bit by bit by bit. So I think it's time for me to get out of the email drizzle for a while, at least to see what it's like on an ongoing basis.

(I intend to still read these mailing list emails periodically, but I'm going to do it in big batches and at a time of my choosing. Over a coffee at the end (or start) of a day at work, perhaps. I'll have to see.)

EmailGettingOutOfTheDrizzle written at 23:21:51; Add Comment


My .procmailrc has quietly sort of turned into a swamp

As part of trying to not read some mailing lists for a while, I was recently going through my .procmailrc. Doing this was eye-opening. It's not that my .procmailrc was messy as such, because I don't have rules that are sophisticated enough to get messy (just a bunch of 'if mail is like <X>, put it into file Y' filtering rules). Instead, mostly what it had was a whole lot of old, obsolete rules that haven't been relevant for years.

Looking back, it's easy to see how these have quietly accreted over time. Like many configuration files, I almost never look at my .procmailrc globally, scanning through the whole thing. Instead, when I have a new filtering rule I want to add, I jump to what seems to be the right place (often the bottom) and edit the new rule in. If I notice in passing what might be an obsolete filtering rule for a type of email that I don't get any more, usually I ignore it, because investigating is more work and wasn't part of my goal when I did 'vi .procmailrc'.

(The other thing that a strictly local view of changes has done to my .procmailrc is create a somewhat awkward structure for the sequence of rules. This resulted in some semi-duplication of rules and one bit of recent miss-classification, when I got the ordering of two rules wrong because I didn't even realize there was an ordering dependency.)

As a result of stubbing my toe on this, I now have two issues (or problems) I'm considering. The first is what to do about those obsolete rules. Some of them are completely dead and can just be deleted, but others are for things that just might come back to life, even if it hasn't happened for years. There is a part of me that wants to preserve those rules somewhere, just in case I want them again some day. This is probably foolish. Perhaps what I should do is save a backup copy somewhere (or just check my .procmailrc into RCS first).

The second is how much of a revision to do. Having now actively looked at the various things I'm doing and want to do in my .procmailrc, there's a temptation to completely restructure it by splitting my rules into multiple files and then including them in the right spots. This would make where to put new rules clearer to future me, make the overall structure much clearer, and make it simpler to do global things like temporarily divert almost all the mailing lists I get off to files (all those rules would be in one file, so I'd either include it or not include it in my .procmailrc). On the other hand, grand reforms are arguably second system enthusiasm showing. It might be that I'd spend a bunch of time fiddling around with my mail filtering and wind up with a much more complicated, harder to follow setup that basically did the same thing.

(Over-engineering in a fit of enthusiasm is a real hazard.)

PS: applications to other configuration files you might have lying around are left as an exercise for the reader, but I'm certainly suspecting that this is not the only file I have (or that we have) that exhibits this 'maintained locally but not globally' slow, quiet drift into a swamp.

ProcmailrcSwamp written at 01:19:42; Add Comment


In practice, putting SSDs into 3.5" drive bays is a big hassle

When I talked about how we failed at making all our servers have SSD system disks, I briefly talked about how one issue was that SSDs are not necessarily easily compatible with 3.5" drive bays. If you have never encountered this issue, you may be scratching your head, because basic spacers to let you put 2.5" drives (SSDs included) into 3.5" drive bays are widely available and generally dirt cheap. Sure, you have to screw some extra things on your SSDs, but unless you're working at a much bigger scale than we are, this doesn't really matter.

The problem is that this doesn't always work in servers, depending on how their drive bays work. The fundamental issue is that a 3.5" SATA HD has its power and SATA connectors at the bottom left edge of the drive, as does a 2.5" SSD, and a simple set of spacers can't position the SSD so that both the connectors and the screw holes line up where they need to be. In servers where you manually insert the SATA and power cables and the provided cables are long enough, you can stretch things to make simple spacers work. In servers with exact-length cables or with hot-swap bays that you slide drives into (either with or without carriers), simple spacers don't work and you need much more expensive sleds (such as IcyDock's).

(Sleds are not very expensive in small quantities, but if you're looking at giving a bunch of servers dual SSD system disks and you're planning to use inexpensive SSDs, adding a $15 part to each disk adds up fast.)

We sort of knew about this issue when we started, but we thought it wasn't going to be a big deal. We were wrong. It adds cost and just as important, it adds friction; it's an extra part to figure out, to qualify, to stock, and to reorder when you start running low. You can't just grab a SSD or two and stick them in a random server, even if you have the SSDs; you have to figure out what you need to get the SSDs mounted, perhaps see if you have one or two sleds left, and so on and so forth.

The upshot of all of this is that we're now highly motivated to get 2.5" drive bays in our next set of basic 1U servers, at least for servers with only two drive bays. As a pleasant side benefit, this would basically give us no choice but to use SSDs in these servers, since we don't have any random old 2.5" HDs and we're unlikely to buy new 2.5" HDs.

(Sadly, this issue is basically forced by the constraints of 3.5" and 2.5" HDs. The power and SATA connectors are at the edge of each because that's where the circuit board goes, and it goes on the outside of the drive in order to leave as much space as possible for the platters, the motors, and so on.)

SSDIn3.5DriveBayProblem written at 02:44:37; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.