My brute force email archive

February 7, 2011

Years ago, I had a brainwave about archiving my email. The brainwave was simple: 'disk is cheap'. So I changed my .forward to save a copy of all of my email to a file, in addition to the other filtering I was doing with it. I don't point a mail client at the file or otherwise use it for anything in my regular email setup; it is purely a backup and completely separate from my regular email client.

(To be honest, it might have evolved out of my careful caution when I started using procmail. Since I didn't entirely trust procmail, I think that I set up my .forward to save a backup copy of all of my email in a file. At some point I then realized that disk space was cheap and didn't actually clear the file, just let it accumulate.)

Recently I realized that it needed one more thing to be really complete and useful; it needed to get a copy of my outgoing email, not just my incoming email. Thus over the past few years I've switched to cc'ing myself on everything I write (generally done automatically by my MUA, replacing saving the messages itself).

There are two important attributes of this brute force archive that make it so useful. First, it is truly comprehensive; it has everything, not just the things that I thought I was going to want (or need) later. I wouldn't say that I'm bad at picking what I'll need later, but I'm not completely accurate at it. Having a complete archive as a backup means that I don't have to be; my accuracy is more a matter of convenience than of necessity.

Second, it's separate from my regular mail environment so that my full archive doesn't clutter up (and slow down) my normal mail folders. This matters because how I want to use my regular folders is very different from how I use a comprehensive archive. If I tried to use only a comprehensive archive, I would immediately start losing important things in it; there would be so much volume and even so many false positives in searches that it would be pretty much useless. I need my regular folders to be curated and sorted (and, sometimes, pruned), to contain the things that I think matter and that I want to be paying attention to. This is nothing like a comprehensive archive.

In theory I could do all of this within a single mail environment. I would just have to be very disciplined about always saving a copy of every message (both received and sent) to my special 'archive' set of folders, no matter how trivial the message was, and then also having it in my regular set of folders as I processed it and perhaps saved it again.

In practice, having the system handle it all by just writing everything to a file is simpler and more reliable (and it's grunt work; computers exist to automatic grunt work). Also, since this file exists entirely outside of any MUA I may use, I know for sure that no MUA is touching it and doing things to the messages; they are archival perfect, exactly as originally received (and exactly in the order they were originally received).

PS: when I say everything I really do mean everything. So yes, this does mean that I get kind of irritated at people who email us ten megabyte files out of the blue (which happens every so often). But disk space really is cheap, and if I need to I can always bzip2 the archives.


Comments on this page:

From 208.87.59.12 at 2011-02-07 02:46:03:

1. Are you saving in mbox or maildir?

2. Are you using mairix for searching, or something else?

Phil Hollenback
www.hollenback.net

By cks at 2011-02-07 11:49:37:

I'm saving in mbox format because that's what happens when you just put a file in your .forward. I'm not doing anything organized for searching; if I need to find something, mostly I use less to look at and search through the raw file. Searching could definitely be improved, but that would take more work than a backup archive normally justifies.

From 76.113.53.175 at 2011-02-07 14:27:27:

I heard about it from John Mashey at USENIX Tech 1999. He said that the comprehensive archive "helped to settle a few lawsuits", so presumably he started doing it way before 1999.

By gsauthof at 2011-02-07 17:50:05:

Are backups going to be a problem with one big archive mbox file?

I guess most backup tools just look at the mtime, i.e. with such a backup tool it is guaranteed that each incremental backup stores the complete mail-archive.

By cks at 2011-02-07 18:06:04:

From my perspective, backups aren't a problem; it's just one semi-big file, and we have plenty of backup capacity.

(My obligatory disclaimer is that I don't get sent lots of big things in email. People who routinely get sent 10 Mb binary blobs of various sorts may need a somewhat different scheme.)

By Dan McD. at 2015-01-18 17:11:38:

These days, wouldn't you stick this on a ZFS filesystem with lz4 turned on?

By cks at 2015-01-18 19:14:05:

Some form of transparent compression would certainly save some space; after all my archive's almost all 7-bit text and often highly repetitive text at that. For here in specific we have our own reasons not to use ZFS's compression and while I could override that on my own home directory filesystem it's not worth the various sorts of hassle; we're not short of disk space.

Written on 07 February 2011.
« Dear Unix mailers: please allow more forgery
Thinking realistically about SQL database field sizes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 7 01:21:43 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.