What makes a email MIME part an attachment?
If you want to know what types of files your users are getting
in email, it's likely that an important
prerequisite is being able to recognize attachments in the first
place. In a sane and sensible world, this would be easy; it would
just be any MIME part with a Content-Disposition
header of
attachment
.
I regret to tell you that this is not a sane world. There are mail clients that
give every MIME part an inline
Content-Disposition, so naturally
this means that most mail clients can't trust an inline
C-D and
make their attachment versus non-attachment decisions based on other
things. (I expect that there are mail clients that ignore a C-D of
attachment
, too, and will display some of those parts inline if
they feel like it, but for logging we don't care much about that.)
MIME parts may have (proposed, nominal) filenames associated with
them, from either Content-Type
or Content-Disposition
. However,
neither the presence nor the absence of a MIME filename determines
something's attachment status. Real attachments may have no proposed
filename, and there are mail clients that attach filenames to things
like inline images. And really, I can't argue with them; if the user
told you that this (inline) picture is mydog.jpg
, you're certainly
within your rights to pass this information on in the MIME headers.
The MIME Content-Type provides at least hints, in that you can
probably assume that most mail clients will treat things with any
application/*
C-T as attachments and not try to show them inline.
And if you over-report here (logging information on 'attachments'
that will really be shown inline), it's relatively harmless. It's
possible that mail clients do some degree of content sniffing, so
the C-T is not necessarily going to determine how a mail client
processes a MIME part.
(At one point web browsers were infamous for being willing to
do content sniffing on HTTP replies, so that what you served
as eg text/plain
might not be interpreted that way by some
browsers. One can hope that mail clients are more sane, but
I'm not going to hold my breath there.)
One caution here: trying to make decisions based on things having
specific Content-Type values is a mug's game. For example, if you're
trying to pick out ZIP files based on them having a C-T of
application/zip
, you're going to miss a ton of them; actual real
email has ZIP files with all sorts of MIME types (including the
catch-all value of application/octet-stream). My impression is that
the most reliable predictor of how a mail client will interpret an
attachment is actually the extension of its MIME filename.
(While the gold standard for figuring out if something is a ZIP
file or whatever is actually looking at the data for the MIME part,
please don't use file
(or libmagic) for general file classification.)
One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.
The current approach I'm testing is to use a collection of signs
to pick out attachment-like things, with some heuristics attached.
Does a MIME part declare a MIME filename ending in .zip
? Then
we'll log some information about it. Ditto if it has a Content-Disposition
of attachment
, ditto if it has a Content-Type of application/*
,
and so on. I'm probably logging information about some things that
mail clients will display inline, but it's better than logging too
little and missing things.
(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)
|
|