2016-07-12
How we do MIME attachment type logging with Exim
Last time around I talked about the options you have for how to log attachment information in an Exim environment. Out of our possible choices, we opted to do attachment logging using an external program that's run through Exim's MIME ACL, and to report the result to syslog in the program. All of this is essentially the least-effort choice. Exim parses MIME for us, and having the program do the logging means that it gets to make the decisions about just what to log.
However, the details are worth talking about, so let's start with the actual MIME ACL stanza we use:
# used only for side effects warn # only act on potentially interesting parts condition = ${if or { \ {and{{def:mime_content_disposition}{!eq{$mime_content_disposition}{inline}}}} \ {match{$mime_content_type}{\N^(application|audio|video|text/xml|text/vnd)\N}} \ } } # decode = default # set a dummy variable to get ${run} executed set acl_m1_astatus = ${run {/etc/exim4/alogger/alogger.py \ --subject ${quote:$header_subject:} \ --csdnsbl ${quote:$header_x-cs-dnsbl:} \ $message_exim_id \ ${quote:$mime_content_type} \ ${quote:$mime_content_disposition} \ ${quote:$mime_filename} \ ${quote:$mime_decoded_filename} }}
(See my discussion of quoting for ${run}
for
what's happening here.)
The initial 'condition =
' is an attempt to only run our external
program (and writing decoded MIME parts out to disk) for MIME parts
that are likely to be interesting. Guessing what is an attachment
is complicated and the program
makes the final decision, but we can pre-screen some things. The
parts we consider interesting are any MIME parts that explicitly
declare themselves as non-inline, plus any inline MIME parts that
have a Content-Type that's not really an inline thing.
There is one complication here, which is our check that
$mime_content_disposition
is defined. You might think that
there's always going to be some content-disposition, but it turns
out that when Exim says the MIME ACL is invoked on every MIME part
it really means every part. Specifically, the MIME ACL is also
invoked on the message body in a MIME email that is not a multipart
(just, eg, a text/plain
or text/html
message). These single-part
MIME messages can be detected because they don't have a defined
content-disposition; we consider this to basically be an implicit
'inline' disposition and thus not interesting by itself.
The entire warn
stanza exists purely to cause the ${run}
to
execute (this is a standard ACL trick; warn
stanzas are often
used just as a place to put ACL verbs). The easiest way to get that
to happen is to (nominally) set the value of an ACL variable, as
we do here. Setting an ACL variable makes Exim do string expansion
in a harmless context that we can basically make into a no-op, which
is what we need here.
(Setting a random ACL variable to cause string expansion to be done for its side effects is a useful Exim pattern in general. Just remember to add a comment saying it's deliberate that this ACL variable is never used.)
The actual attachment logger program is written in Python because
basically the moment I started writing it, it got too complicated
to be a shell script. It looks at the content type, the content
disposition, and any claimed MIME filename in order to decide whether
this part should be logged about or ignored (using the set of
heuristics I outlined here).
It uses the decoded content to sniff for ZIP and RAR archives and
get their filenames (slightly recursively). We could have run more external
programs for this, but it turns out that there are handy Python
modules (eg the zipfile
module) that will do
the work for us. Working in pure Python probably doesn't perform
as well as some of the alternatives, but it works well enough for
us with our current load.
(In accord with my general principles, the program is careful to minimize the information it logs. For instance, we log only information about extensions, not filenames.)
The program is also passed the contents of some of the email headers
so that it can add important information from them to the log
message. Our anti-spam system adds a spam or virus marker to the
Subject:
header for recognized bad stuff, so we look for that
marker and log if the attachment is part of a message scored that
way. This is important for telling apart file types in real email
that users actually care about from file types in spam that users
probably don't.
(We've found it useful to log attachment type information on inbound email both before and after it passes through our anti-spam system. The 'before' view gives us a picture of what things look like before virus attachment stripping and various rejections happen, while the 'after' view is what our users actually might see in their mailboxes, depending on how they filter things marked as spam.)
Sidebar: When dummy variables aren't
I'll admit it: our attachment logger program prints out a copy of
what it logs and our actual configuration uses $acl_m1_astatus
later, which winds up containing this copy. We currently immediately
reject all messages with ZIP files with .exe
s in them, and rather
than parse MIME parts twice it made more sense to reuse the attachment
logger's work by just pattern-matching its output.