Wandering Thoughts archives


Modern email is hard to search, which encourages locked up silos

Today, I tweeted:

One of the things I don't like about modern email is how you can't really grep it in raw as-received form, because too many emails are encoded in eg base-64. Before you can thoroughly search you must parse and de-MIME everything (and store it that way).

Once upon a time, email was plain text (or at least mostly). This had the useful consequence that you could dump it in one or more files (even one file per email message) and then do basic searches through it with any tool that could search through plain text. There are a lot of tools that search through plain text, especially on Unix.

(Unless you only had one message per file, it was never quite true that you could do good searching without parsing the email structure in any way. If you wanted to search for two things being mentioned in the same email message, you needed something that could understand message boundaries. But this was not that much work, and you could construct it with brute force if you had to.)

Those days are of course long gone. A lot of email today is encoded, for example because it contains some UTF-8 characters and email is still theoretically seven-bit ASCII only. This means that in order to do a good job of searching email, you must be able to decode all of this, which requires being able to parse the MIME structure of email messages.

Parsing MIME and decoding email messages is not the fastest thing in the world (and it's also not the easiest). If you want to do fast searching or use general text searching tools, you can no longer store email in its raw, as received state; you need to come up with some partially or completely decoded format. There's no standard storage format for this, so everyone makes up their own, then doesn't document it or commit to preserving backwards compatibility with it in the future. This restricts what tools can be used to do even basic text searches on your archived email, and is part of what encourages custom archiving formats.

The result is that to do a good job of searching modern email, you need to use a relatively narrow range of tools instead of having your pick of anything that can search or index text. Often your tools will be restricted to whatever's built into your mail client.

(The extreme case of this are web-based mail systems where you don't normally get the text form of the mail at all, just a rendered version, and all of the searching happens on the server with whatever features, tools, and decoding that they choose to support.)

tech/ModernEmailSearchingProblem written at 00:18:43; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.