Wandering Thoughts archives

2010-10-29

What problems the Maildir mail storage format solves

There is a relatively widespread view that the Maildir mail storage format is the solution to any issue that one is having with traditional mail message storage (cf the comment here). This is not exactly so, and as a result it is important to understand what problems Maildir actually solves and what ones it does nothing about. Or for that matter, the problems that Maildir can make worse.

(In this entry I'm going to compare Maildir against the traditional Unix mbox format.)

Maildir excels at all single-message updates, including deleting single messages and moving them into another folder. These can be done with relatively inexpensive filesystem operations or at the worst, rewriting a single message; by contrast, in mbox format a true delete or significant message modification generally requires copying the entire rest of the mailbox, and indexes can't help mbox at all.

(You can get fast 'delete' in mbox by just marking the message as deleted, but this doesn't reclaim any space and reclaiming space may be important to the user.)

If a message has known flags, Maildir gives you immediate access to it because you can open it by filename. If the message has unknown flags but you know its identifier, you need to scan the directory; this is likely to still be better than a linear read through the mbox file. Indexes can make mbox just as fast as Maildir, though.

Maildir is at best a modest help with any sort of sequential scan situation. If your mail client wants to see four headers from every message in a folder, you still have to open each message and read the first chunk of it. Whether reading the first chunk of a bunch of files is faster than a linear read through an mbox format file is an interesting question, and certainly Maildir isn't going to be faster than doing the equivalent with an indexed mbox file. Reading most of each message is all but certain to be worse than working on an mbox file, because using separate files defeats a lot of OS level readahead and the file data may well be more spread out across the disk than it would have been in a single file.

(If you want to accelerate this sort of scan, you need the relevant headers indexed and cached in some more conveniently digestible format. This is true of both mbox and Maildir.)

All of this assumes that you have either small directories (ie small folders) or non-linear directories and thus can assume that looking up a file in a directory is a very cheap operation. If you're on a Unix that still uses linear directories and you have large folders, simple filename lookup and manipulation can become quite an expensive operation itself (as can scanning the directory).

WhenMaildirWins written at 00:24:05; Add Comment

2010-10-18

The cache eviction death spiral

Here's a cache behavior I've touched on in passing several times but I feel like covering explicitly now. It's what I've taken to call the cache eviction death spiral, where cache evictions start you down a path to grinding death.

Cache evictions of useful data have an obvious problem; they slow down the program when you have to pull the needed data back into cache. But in a contended cache under pressure from multiple sources, they have a second effect as well; because your program is now running slower, it is touching the data that it needs less often, 'cooling down' the data and making it more likely to be evicted from the cache in turn. Of course this then repeats, worse.

(Technically this only happens in caches that are based on some variety of least-recently-used or least-frequently-used. That would be pretty much all of them.)

This is the cache eviction death spiral. It can happen in any situation where the cache is being contended over (as opposed to the single process case where the cache just isn't big enough), and does not necessarily result in a winner; if you have enough sources of contention, they can all thrash each other to death, with everyone too busy stealing entries from other processes in order to get their own entries back into memory to do any real work.

An especially dangerous situation for a cache eviction death spiral is when some users of the cache have a much easier time keeping their data hot (or in general claiming cache space) than other users. This can create a brutal feedback cycle that rapidly more or less freezes the slow users out of the cache entirely (or at least confines them to whatever small portion the fast users don't need). Generally this results in them grinding to a complete halt (although performance monitoring may claim that they are very active).

If you're designing a cache in a system it pays to consider if you're susceptible to this issue, especially if your system can wind up starving some users. (I don't know if the literature has good ways around this, but if I was searching I would start with the operating system people; this crops up a lot in operating system caching.)

CacheEvictionDeathSpiral written at 00:10:13; Add Comment

2010-10-17

Read IO is generally synchronous (unlike write IO)

One of the important general difference between read IO and write IO is that read IO is generally synchronous (in its normal form), and not just because the operating system interfaces are usually synchronous. Read IO is synchronous in general because the program is almost always going to immediately look at and do something with the data that it's read, and it can't go on until it's done this. It's a relatively rare program that doesn't look at the data but instead either discards it or turns around and hands it to another system interface.

(Okay, it used to be a relatively rare program that did this. In the modern web world there are a number of cases where the program will indeed never look at a lot of the data itself and instead just passes it along to somewhere else.)

This implies that regardless of the interface you offer to programs and regardless of how you implement it behind the scenes, one way or another you have to get that data off the those rusty platters before the program can go on to do anything else. Cleverness in mmap() or zero-copy read IO or the like only gets you so far, because you have to materialize the data either way. The only way to get out of this is to somehow read the data off the platters before the program tries to look at it by using some sort of readahead; whether this is done by the program (with asynchronous IO or something similar) or by the operating system (by noticing access patterns) is somewhat of an implementation detail.

Explicit asynchronous read IO interfaces are still useful for two reasons. First, the program may know more about its access patterns than the operating system can feasibly deduce. Second, there are cases where the program is working on several things at once and could proceed on whichever one of them gets off the disk first.

In theory write IO is equally synchronous, in that programs often want to reuse the buffers that they just wrote without changing what they wrote. In practice this is usually effectively asynchronous IO, because programs usually don't have to wait until the data gets all of the way down to the spinning rust; it's enough that the operating system has taken it off their hands.

ReadIOIsSynchronous written at 03:47:33; Add Comment

2010-10-12

Why non-linear directories have not been popular in filesystems

You might wonder why non-linear directories are so complicated that they haven't been used for most of Unix's history.

Computer science has a lot of data structures that are much better for both lookup and updates than unordered linear arrays. However, pretty much all of them are designed with the assumption that they're living in RAM, and even today RAM has very different characteristics from disks. Specifically, in RAM both fine-grained lookups and fine-grained shuffling of entries are relatively inexpensive. Neither is true for things that are on disk so on disk data structures need to be designed to minimize the amount of IO, especially the amount of random IO. You can do this, but for much of Unix's life this meant that the job was not as simple as implementing a simple and known data structure.

(One wrinkle here is that these data structures should keep working well even in a deliberately hostile environment, one where an attacker is trying to create a worst-case set of directory contents. This is a situation generally not faced by in-RAM algorithms.)

Next come the complications. Pretty much any efficient disk based fast lookup data structure is going to have rebalancing requirements, where adding or removing an entry may require shuffling other entries around. The first complication is that filesystems have to worry about untimely interruptions when they are partway through such rebalancing operations. In fact, they have to stay broadly consistent if such interruptions happen; after all, users will be very unhappy if 'touch file1' can result in the total loss of a large directory because the filesystem was halfway through a major reorganization when a power failure hit.

Second, a filesystem probably wants to allow concurrent updates to directories while a program such as ls is iterating through the directory. Since any update can cause a rebalancing, you then have to at least make sure that the iterating process does not wind up reading from, say, a now-released directory data block. Ideally you want the program to get sensible results despite the actual data structure changing majorly underneath it.

(This is somewhat challenging even for copy on write filesystems because you don't want ls to see a filename in a directory after the file has been deleted or renamed.)

The next question is whether non-linear directories are worth it in general on typical systems. A contiguous linear directory of even semi-modest size can often be completely read in a single disk IO, and after it's in memory a linear scan usually takes trivial time compared to disk IO; the degenerate case is a single-block directory. This means that there is a minimum directory size (in number of entries and on-disk size) below which non-linear directories get you no benefit because searching them takes just as much disk IO as searching a linear directory. How many directories on a typical system are over this size, and how much are they over? In addition Unix systems have usually had very high cache hit ratios for 'hot' directories, which reduces the IO advantage that non-linear directories have.

(There are certainly situations where non-linear directories win big, but these situations tend to be uncommon precisely because those situations have been big performance losers for much of Unix's life.)

Linear directories are not as efficient as better data structures, but they are much, much simpler to implement in a filesystem. Many of these issues have straightforward answers and often simple code will do the right thing. Non-linear directories are generally going to be much more complicated, with much more code and many more issues, and they don't get you very much except in uncommon situations. It should not be a surprise that most Unix filesystem implementors have had higher priorities and most filesystems have used some form of linear directories.

(Honesty in blogging time: I'm not a filesystem designer and I haven't gone looking to find out what the state of the art is here. It's possible that modern designs have found simple and effective solutions to these issues, so the problem isn't as hard as I think it is.)

HardNonlinearDirectories written at 01:15:48; Add Comment

2010-10-07

In universities, you often buy hardware when you can

I like to dream that in a real company, you can go out and just buy server hardware when it's needed. Around here, sadly, it doesn't work that way.

At least in my section of the university, one of the problems is that money for hardware can be what I'll call 'bursty'. We don't have an ongoing hardware budget; instead, we usually get one-time money sort of at random, when funds can be found or liberated from grant funds. Over the long term this averages out acceptably, but it's quite possible for there to be dry spells in the medium term for various reasons.

(Oh, if there is a real need or an emergency, money can be found. But even if there is a real need it may take a bunch of time to get it and irritate a bunch of important people in the process.)

One of the consequences of this is that we tend to buy hardware whenever we have the budget for it if we have any plausible future need for it, even if we don't specifically need the hardware right now. Otherwise, when we actually need it in six months or a year or even more we may find that there is no budget for it and we're stuck (barring a real emergency), or at least we have to deal with various sorts of awkward delays.

The downside of buying hardware early, even necessary hardware, is that we'd usually get a better deal if we waited; computer hardware generally drops in price (and improves) over time. But from our perspective it's much better to have hardware in reserve so that we can roll out capacity expansions smoothly as they're needed even if it results in us getting an objectively worse deal on, eg, hard disks.

(Note that we are generally not doing things that demand leading edge CPU performance or the like, which is one reason this is a viable thing to do. Most of our servers are years old anyways and work just fine for pretty much everything we deal with. Research groups doing heavy computation have much different needs and thus much different buying patterns.)

There are some aspects of this that we could handle better. For example, given that a certain amount of the hardware we buy will sit around unused for significant amounts of time, we should probably make a habit of unboxing everything when it comes in and at least turning it on to make sure it is not dead on arrival. Otherwise we'll someday open a box and be embarrassed to discover that the hardware is DOA and also old enough that the warranty has already expired.

Sidebar: the other reason to buy when you can

The other reason to buy hardware when you can is that it may not be available any more by the time that you decide you want more. Even for normal hardware this can be annoying, because you have to go through a whole evaluation and qualification process to find a replacement. If you actively liked the hardware in question it's even more irritation. And of course you wind up with a mixed environment instead of a uniform one, which complicates your life in various ways.

(Even if you have no formal requirement for competitive hardware purchasing, you can't buy new hardware and deploy it to production without testing it to make sure that it works.)

As a result, we have a tendency to load up on hardware we really like when we find some (especially if the vendor runs an end of quarter bargain sale; we love those, at least when we have the spare budget). We also like building a good spares pool for things that we're likely to expand in the future, such as network switches or the components of our fileserver environment.

(Network switches are one of those things where you will hate yourself a lot if you wind up with fifty different models and varieties.)

UniversitiesBuyWhenYouCan written at 00:02:54; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.