Wandering Thoughts archives

2010-09-29

A surprise about random IO versus unnecessary IO

Here is a tough trick question: which is better, a mailbox format where all the messages are just concatenated together (eg, traditional Unix mbox) or a mailbox format where everything is concatenated together but you have an index of the starting byte offsets of every message's headers and body (an indexed mbox format)?

(Assume that there is no overhead for maintaining the index.)

Some years ago, I would have confidently told you that the indexed mbox format was clearly better, and it was obviously more efficient to read the message headers (when, for example, an IMAP client scans a mailbox) by reading the index, then doing lseek() + read() of the message headers for every message. After all, disk IO is expensive and this was doing as little of it as possible.

Then I had an interesting experience and suddenly wound up disagreeing with my old self.

If the email messages are not too big I now strongly believe that it can often be better to read the entire mailbox sequentially, even at the cost of throwing away much of the data that got read. What is expensive on modern disks is (as always) the seeks, not the reads; reading 4K or 64K after a seek is almost the same cost. And the great benefit of reading sequentially is that it will activate readahead prefetching (even prefetching that forces disk seeks), which can drastically reduce the latency of read() requests and thus the overall time the whole scan takes. Effectively the lseek() + read() sequence reduces you to synchronous disk IO, while allowing readahead to kick in means that you are more and more asynchronous.

(This is really going to matter during significant load, where becoming synchronous will make you very slow indeed; if reading every message header takes a seek, typical disks will mean that everyone together scans 100 messages or so per disk per second.)

This creates an obvious model for when you win and when you lose. You win when reading the extra data (with system readahead happening) can complete in less time than waiting for a seek to finish. Short email messages will let you win big; long email messages will be bad. Lack of readahead will probably cause you to lose unless the mailbox is laid out sequentially on the disk and the messages are small enough that you can read several for essentially free.

(Note that there are some modern filesystems that always read in relatively large contiguous blocks, even if you only asked for a relatively small chunk of data; I believe that ZFS is one of them. These filesystems change this model a bit, since they sometimes convert what looks like a lseek() + read() sequence into 'readahead'.)

A clever program can of course use the index to be smart about this. It doesn't necessarily know what's going on at the OS and filesystem layout level, but it can use the index information to more or less know when reading is going to be ineffective because it has to read too much to get to the next message header. If it wants, it can make this partly dynamic by monitoring the sequential read bandwidth it's getting and guessing (or measuring) how much the occasional seek costs it.

SeekingVsReadingSurprise written at 00:26:01; Add Comment

2010-09-13

Why you want a recursive-forwarding caching nameserver

One of the things that I mentioned I wanted in a caching nameserver was the ability to forward queries to other nameservers as both recursive and non-recursive queries. A commentator asked, effectively, why you needed to forward recursive queries. As it turns out I did not have a handy answer for this at the time, because I couldn't remember off the top of my head where I'd needed it. Well, I think I've come up with the answer.

Why you need this is zone splicing in upstream caching DNS servers. Suppose that your organization operates some official caching nameservers for everyone inside the organization to query, and further suppose it has a mixture of internal and external domains. One way to set this up is split horizon DNS, but another way is to simply splice your internal zones into the configuration of your official caching nameservers by directly specifying master or forwarding servers for them.

(This is perfectly safe if your caching nameservers are already only accessible to internal machines.)

When you splice zones in this way, they do not necessarily have NS records (and even if they do, there are situations where clients may not be able to directly query the master nameservers for the zone). Instead, everything works because the official nameserver simply forwards queries and glues everything together for clients.

However, if you send a non-recursive query for one of those spliced zones to the upstream official nameserver, the upstream will not necessarily forward your query to the real DNS servers; after all, you told it not to do that. The result is that your query fails if the answer is not already cached in the upstream.

So if you are putting together a local caching nameserver, you really want it to be able to send recursive queries to the official machine. If it will only send non-recursive queries, you need to obtain, duplicate, and keep current all of the zone splices that the upstream nameserver does in order to get the same results. With recursive queries, you just need to recursively forward the top level splice point and you're done, which is a lot easier to set up and maintain.

(As it happens, we have just such a zone splicing official caching nameserver locally. When I was trying to do this with djbdns's dnscache it was a pain in the rear to set everything up, and I was one of the people maintaining the official caching nameserver.)

PS: hopefully this makes sense. I have not been able to actually try suitable dig queries to verify this behavior (although it would be unusual for BIND to invent phantom NS records at zone splice points), and I am slightly under the weather at the moment.

WhyNameserverRecursiveCaching written at 03:39:53; Add Comment

2010-09-12

Keyword expansion: why RCS rewrites files when you commit them

Given my previous entry, you might wonder why RCS does such a crazy thing as deleting and recreate files when you just make a commit (in RCS terms, a checkin) when modern version control systems don't do this. The answer turns out to be pretty simple: keyword expansion. Specifically, that RCS supports keyword expansion and modern version control systems don't.

RCS has a feature where you can embed magic keywords in your source code that are automatically expanded to various pieces of information when RCS checks out your file. The classical example is various identification of the file version; the traditional use for this is embedding it in C source as a static string, so that it will appear in the binary and can later be extracted to determine which source version was used to build some random binary version.

(Assuming, that is, that the binary was built from unedited source files, with no uncommitted edits. You may be starting to see the problems here.)

This sounds like a very convenient feature, but it has a cost; it means that the file's proper contents change every time you commit it. Thus, every time you commit a file you (may) need to update the checked out version in order to make its keyword expansions have their proper value. Which means deleting and recreating files a lot, at least if you are RCS.

(In theory RCS could notice that the committed file does not have any keywords to expand and thus doesn't need to be recreated, which is the usual case these days. In practice it is not that smart.)

This also complicates various other aspects of RCS; the obvious one is checking for differences between live files and the repository. This needs to scan the live file for keywords and de-expand them in order to avoid spurious differences, which naturally slows down and complicates the process.

These complications are a large part of why modern version control systems have by and large strongly rejected keyword expansion (cf Mercurial or git). Not having keyword expansion is occasionally inconvenient for users, but it makes modern VCSes significantly faster and more attractive.

VCSKeywordImpact written at 01:25:44; Add Comment

2010-09-05

A plan to deal with my feed reader problem

I have a feed reader problem, one that has long ago reached epic levels: in practice, I'm not actually really reading feed entries. For years, Liferea has been telling me that I have thousands of unread entries and I have been ignoring them. I think it's time to declare feed reader bankruptcy (which is much like email bankruptcy) and deal honestly with the results.

(This will be a bit traumatic, because I'm somewhat obsessive about some things. It hurts to consciously and deliberately throw away unread entries.)

In thinking about this, I have realized that I have two sorts of feeds that I follow: casual reading feeds, that I keep around so that I have something to browse when I'm feeling bored and want to poke at their topic, and feeds that I am strongly interested in and want to read all or almost all of, even if it takes me a while. If I'm being honest about it, almost all of the feeds I currently have in Liferea are casual reading feeds (which is one reason I keep not reading them).

So here's my current plan for dealing with all of this:

A certain amount of the casual feeds are simply going to be discarded (a process that I've already started); I'll trust that anything worth reading that they produce will show up on the usual link sources that I browse (such as Hacker News). The rest of them will go into Google Reader, because Google Reader will quietly expire old unread entries for me. Throwing away old entries to keep the volume manageable is exactly the behavior I want for casual feeds.

(Google Reader is also better for casual browsing because I can use it from anywhere. Liferea is tied to a particular machine.)

My important feeds will stay in Liferea where I can exert more control over them, for example deciding exactly when they expire (or don't). I will probably also find some feeds that are more convenient to read in Liferea than in Google Reader. If I do this right I will have only a relatively small number of feeds in Liferea, and they will generally not have many unread entries.

I'm not sure that this will actually work, but I'll have to see how it goes. Something certainly needs to change; thousands of unread feed entries that just keep expiring off the bottom of feeds just don't work.

(They 'work' in one sense, but they create a kind of mental pressure that makes me avoid having much to do with them. Right now I avoid entire categories of feeds in Liferea because of all of the unread entries.)

PS: if I'm being honest with myself, I should probably throw away at least half of my casual feeds. Many of them were added because they looked sort of interesting, way back in my early days of feed reading enthusiasm when I felt that I had a lot more time for this. Rather than putting them in Google Reader only to ignore them, I should just save them in a file somewhere.

(This reminds me rather vividly of mailing lists, and if I go far enough back, Usenet. I went through much the same pattern with them that I am going through with feeds now, and if I got into something like Twitter I suspect that I would go through the same pattern with it too.)

DealingWithMyFeeds written at 00:30:37; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.