A surprise about random IO versus unnecessary IO

September 29, 2010

Here is a tough trick question: which is better, a mailbox format where all the messages are just concatenated together (eg, traditional Unix mbox) or a mailbox format where everything is concatenated together but you have an index of the starting byte offsets of every message's headers and body (an indexed mbox format)?

(Assume that there is no overhead for maintaining the index.)

Some years ago, I would have confidently told you that the indexed mbox format was clearly better, and it was obviously more efficient to read the message headers (when, for example, an IMAP client scans a mailbox) by reading the index, then doing lseek() + read() of the message headers for every message. After all, disk IO is expensive and this was doing as little of it as possible.

Then I had an interesting experience and suddenly wound up disagreeing with my old self.

If the email messages are not too big I now strongly believe that it can often be better to read the entire mailbox sequentially, even at the cost of throwing away much of the data that got read. What is expensive on modern disks is (as always) the seeks, not the reads; reading 4K or 64K after a seek is almost the same cost. And the great benefit of reading sequentially is that it will activate readahead prefetching (even prefetching that forces disk seeks), which can drastically reduce the latency of read() requests and thus the overall time the whole scan takes. Effectively the lseek() + read() sequence reduces you to synchronous disk IO, while allowing readahead to kick in means that you are more and more asynchronous.

(This is really going to matter during significant load, where becoming synchronous will make you very slow indeed; if reading every message header takes a seek, typical disks will mean that everyone together scans 100 messages or so per disk per second.)

This creates an obvious model for when you win and when you lose. You win when reading the extra data (with system readahead happening) can complete in less time than waiting for a seek to finish. Short email messages will let you win big; long email messages will be bad. Lack of readahead will probably cause you to lose unless the mailbox is laid out sequentially on the disk and the messages are small enough that you can read several for essentially free.

(Note that there are some modern filesystems that always read in relatively large contiguous blocks, even if you only asked for a relatively small chunk of data; I believe that ZFS is one of them. These filesystems change this model a bit, since they sometimes convert what looks like a lseek() + read() sequence into 'readahead'.)

A clever program can of course use the index to be smart about this. It doesn't necessarily know what's going on at the OS and filesystem layout level, but it can use the index information to more or less know when reading is going to be ineffective because it has to read too much to get to the next message header. If it wants, it can make this partly dynamic by monitoring the sequential read bandwidth it's getting and guessing (or measuring) how much the occasional seek costs it.

Written on 29 September 2010.
« Why there is no POSIX standard for a Unix GUI
A lot of my bugs are conceptual oversights »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 29 00:26:01 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.