What makes a email MIME part an attachment?

June 30, 2016

If you want to know what types of files your users are getting in email, it's likely that an important prerequisite is being able to recognize attachments in the first place. In a sane and sensible world, this would be easy; it would just be any MIME part with a Content-Disposition header of attachment.

I regret to tell you that this is not a sane world. There are mail clients that give every MIME part an inline Content-Disposition, so naturally this means that most mail clients can't trust an inline C-D and make their attachment versus non-attachment decisions based on other things. (I expect that there are mail clients that ignore a C-D of attachment, too, and will display some of those parts inline if they feel like it, but for logging we don't care much about that.)

MIME parts may have (proposed, nominal) filenames associated with them, from either Content-Type or Content-Disposition. However, neither the presence nor the absence of a MIME filename determines something's attachment status. Real attachments may have no proposed filename, and there are mail clients that attach filenames to things like inline images. And really, I can't argue with them; if the user told you that this (inline) picture is mydog.jpg, you're certainly within your rights to pass this information on in the MIME headers.

The MIME Content-Type provides at least hints, in that you can probably assume that most mail clients will treat things with any application/* C-T as attachments and not try to show them inline. And if you over-report here (logging information on 'attachments' that will really be shown inline), it's relatively harmless. It's possible that mail clients do some degree of content sniffing, so the C-T is not necessarily going to determine how a mail client processes a MIME part.

(At one point web browsers were infamous for being willing to do content sniffing on HTTP replies, so that what you served as eg text/plain might not be interpreted that way by some browsers. One can hope that mail clients are more sane, but I'm not going to hold my breath there.)

One caution here: trying to make decisions based on things having specific Content-Type values is a mug's game. For example, if you're trying to pick out ZIP files based on them having a C-T of application/zip, you're going to miss a ton of them; actual real email has ZIP files with all sorts of MIME types (including the catch-all value of application/octet-stream). My impression is that the most reliable predictor of how a mail client will interpret an attachment is actually the extension of its MIME filename.

(While the gold standard for figuring out if something is a ZIP file or whatever is actually looking at the data for the MIME part, please don't use file (or libmagic) for general file classification.)

One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.

The current approach I'm testing is to use a collection of signs to pick out attachment-like things, with some heuristics attached. Does a MIME part declare a MIME filename ending in .zip? Then we'll log some information about it. Ditto if it has a Content-Disposition of attachment, ditto if it has a Content-Type of application/*, and so on. I'm probably logging information about some things that mail clients will display inline, but it's better than logging too little and missing things.

(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)

Written on 30 June 2016.
« Modern DNS servers (especially resolvers) should have query logging
Some advantages of using argparse to handle arguments as well as options »

Page tools: View Source.
Search:
Login: Password:

Last modified: Thu Jun 30 01:03:10 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.