Modern email is hard to search, which encourages locked up silos

July 28, 2021

Today, I tweeted:

One of the things I don't like about modern email is how you can't really grep it in raw as-received form, because too many emails are encoded in eg base-64. Before you can thoroughly search you must parse and de-MIME everything (and store it that way).

Once upon a time, email was plain text (or at least mostly). This had the useful consequence that you could dump it in one or more files (even one file per email message) and then do basic searches through it with any tool that could search through plain text. There are a lot of tools that search through plain text, especially on Unix.

(Unless you only had one message per file, it was never quite true that you could do good searching without parsing the email structure in any way. If you wanted to search for two things being mentioned in the same email message, you needed something that could understand message boundaries. But this was not that much work, and you could construct it with brute force if you had to.)

Those days are of course long gone. A lot of email today is encoded, for example because it contains some UTF-8 characters and email is still theoretically seven-bit ASCII only. This means that in order to do a good job of searching email, you must be able to decode all of this, which requires being able to parse the MIME structure of email messages.

Parsing MIME and decoding email messages is not the fastest thing in the world (and it's also not the easiest). If you want to do fast searching or use general text searching tools, you can no longer store email in its raw, as received state; you need to come up with some partially or completely decoded format. There's no standard storage format for this, so everyone makes up their own, then doesn't document it or commit to preserving backwards compatibility with it in the future. This restricts what tools can be used to do even basic text searches on your archived email, and is part of what encourages custom archiving formats.

The result is that to do a good job of searching modern email, you need to use a relatively narrow range of tools instead of having your pick of anything that can search or index text. Often your tools will be restricted to whatever's built into your mail client.

(The extreme case of this are web-based mail systems where you don't normally get the text form of the mail at all, just a rendered version, and all of the searching happens on the server with whatever features, tools, and decoding that they choose to support.)


Comments on this page:

Interesting to see your perspective on this, because I feel almost the opposite.

Email is my favourite format to search because Notmuch is such a great tool. It indexes mail, and then provides a really nice command line search interface. It plays well with other tools for further mail processing as well, like mblaze, by having an option where it will just print out the file paths of all the matching messages.

It's so much better than anything else I've ever used, for any other format including plain text, that I've try to shoehorn as much data as possible into MIME format just so I can search it with Notmuch.

(Various mail clients also have nice integration with Notmuch, like Emacs and NeoMutt.)

By Walex at 2021-07-28 10:56:40:

Among the "relatively narrow range of tools" I recommend a lot Recoll, which is a general purpose indexer, with a large range of content extractors, including one that can process email pretty well (it keeps the indexed email text cached in plain form for preview purposes).

It is very robust and search accuracy seems to me very good, both with simple queries, or powerful search form, or a sophisticated query language.

Its only downside is that it takes 5%-10% of the indexed space, depending on the percentage of indexable content.

The search tool that I miss most for GNU/Linux is an image similarity search tool (Geeqie has an image similarity deduplication tool). There used to be imgSeek here: https://sourceforge.net/projects/imgseek/ but is has been unmaintained for over ten years.

My inbox is about 150k messages in 20 GB, spanning ~25 years. Content is just about everything, including text, PDFs, Office documents, ZIP files etc.

On this journey I used alot of MUAs: Pegasus Mail, mutt, Thunderbird, Outlook, The Bat!, maybe a handful of others I forget. Searching through everything and finding stuff in the archives has always been an issue (I also run a business).

Just recently I imported everything into Dovecot with Roundcube as frontend. The magic search sauce is https://github.com/grosjo/fts-xapian. It will index text and binary formats (which will take a while) and I can now search through those 20 GB, with excellent results returned within a second or too. For me that is Google lightning fast territory, powered by a low-end Xeon next to my desk. This setup will find strings in PDFs where Outlook and Thunderbird failed.

Really great piece of open source software, if you dare to self-host or put this into production at your place of employment.

Written on 28 July 2021.
« Understanding plain Linux NVMe device names (in /dev and kernel messages)
How Go maps store their values (and keys) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 28 00:18:43 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.