A surprise about random IO versus unnecessary IO
Here is a tough trick question: which is better, a mailbox format where all the messages are just concatenated together (eg, traditional Unix mbox) or a mailbox format where everything is concatenated together but you have an index of the starting byte offsets of every message's headers and body (an indexed mbox format)?
(Assume that there is no overhead for maintaining the index.)
Some years ago, I would have confidently told you that the indexed mbox
format was clearly better, and it was obviously more efficient to read
the message headers (when, for example, an IMAP client scans a mailbox)
by reading the index, then doing
read() of the message
headers for every message. After all, disk IO is expensive and this was
doing as little of it as possible.
Then I had an interesting experience and suddenly wound up disagreeing with my old self.
If the email messages are not too big I now strongly believe that it can
often be better to read the entire mailbox sequentially, even at the
cost of throwing away much of the data that got read. What is expensive
on modern disks is (as always) the seeks, not the reads; reading 4K
or 64K after a seek is almost the same cost. And the great benefit of
reading sequentially is that it will activate readahead prefetching
(even prefetching that forces disk seeks), which can drastically reduce
the latency of
read() requests and thus the overall time the whole
scan takes. Effectively the
read() sequence reduces you to
synchronous disk IO, while allowing readahead to kick in means that you
are more and more asynchronous.
(This is really going to matter during significant load, where becoming synchronous will make you very slow indeed; if reading every message header takes a seek, typical disks will mean that everyone together scans 100 messages or so per disk per second.)
This creates an obvious model for when you win and when you lose. You win when reading the extra data (with system readahead happening) can complete in less time than waiting for a seek to finish. Short email messages will let you win big; long email messages will be bad. Lack of readahead will probably cause you to lose unless the mailbox is laid out sequentially on the disk and the messages are small enough that you can read several for essentially free.
(Note that there are some modern filesystems that always read in
relatively large contiguous blocks, even if you only asked for a
relatively small chunk of data; I believe that ZFS is one of them.
These filesystems change this model a bit, since they sometimes convert
what looks like a
read() sequence into 'readahead'.)
A clever program can of course use the index to be smart about this. It doesn't necessarily know what's going on at the OS and filesystem layout level, but it can use the index information to more or less know when reading is going to be ineffective because it has to read too much to get to the next message header. If it wants, it can make this partly dynamic by monitoring the sequential read bandwidth it's getting and guessing (or measuring) how much the occasional seek costs it.
Comments on this page:Written on 29 September 2010.