2016-06-30
Some advantages of using argparse to handle arguments as well as options
I started using Python long enough ago that there was only the getopt module, which was okay because that's what I was used to from C and other Unix languages (shell, Perl, etc), and then evolved for a bit through optparse; I only started using argparse relatively recently. As a result of all of this background, I'm used to thinking of 'argument handling' as only processing command line switches and their arguments for you, and giving you back basically a list of the remaining arguments, which is your responsibility to check how many there are, parse, and so on.
Despite being very accustomed to working this way, I'm starting to abandon it when using argparse. Part of this is what I discovered the first time I used argparse, namely that it's the lazy way. But I've now used argparse a second time and I'm feeling that there are real advantages to letting it handle as many positional arguments as possible in as specific a way as possible.
For instance, suppose that you're writing a Python program that takes exactly five positional arguments. The lazy way to handle this is simply:
parser.add_argument("args", metavar="ARGS", nargs=5)
If you take exactly five arguments, they probably mean different things. So the better way is to add them separately:
parser.add_argument("eximid", metavar="EXIMID") parser.add_argument("ctype", metavar="CONTENT-TYPE") parser.add_argument("cdisp", metavar="CONTENT-DISPOSITION") parser.add_argument("mname", metavar="MIMENAME") parser.add_argument("file", metavar="FILE")
Obviously this gives you easy separate access to each argument in
your program, but the really nice thing this does is that it adds
some useful descriptive context to your program's usage message.
If you choose the metavar
values well, your usage message will
strongly hint to what needs to be supplied as each argument. But
we can do better, because argparse is perfectly happy to let you
attach help to positional arguments as well as to switches (and it
will then print it out again in the usage message, all nicely
formatted and so on).
You can do the same thing by hand, of course; there's nothing preventing you from writing the same documentation with manual argument parsing and printing it out appropriately (although argparse does do a good job of formatting it). But it feels easier with argparse and it feels more natural, because argparse lets me put everything to do with a positional argument in one spot; I can name the internal variable, specify its visible short name, and then add help, all at once. If nothing else, this is likely to keep all of these things in sync with each other.
(And I'm not going to underestimate the importance of automatic good formatting, because that removes a point of friction in writing the help message for a given positional argument.)
The result of all of this is that using argparse for positional arguments in my latest program has effortlessly given me not just a check for having the right number of positional arguments but a bunch of useful help text as well. Since I frequently don't touch programs for a year or two, I foresee this being a useful memory jog for future me.
In summary, if I can get argparse to handle my positional arguments in future Python programs, I'm going to let it. I've become convinced that it's not just the lazy way, it's the better way.
(This is where some Python people may laugh at me for having taken so long to start using argparse. In my vague defense, we still have some machines without Python 2.7.)
What makes a email MIME part an attachment?
If you want to know what types of files your users are getting
in email, it's likely that an important
prerequisite is being able to recognize attachments in the first
place. In a sane and sensible world, this would be easy; it would
just be any MIME part with a Content-Disposition
header of
attachment
.
I regret to tell you that this is not a sane world. There are mail clients that
give every MIME part an inline
Content-Disposition, so naturally
this means that most mail clients can't trust an inline
C-D and
make their attachment versus non-attachment decisions based on other
things. (I expect that there are mail clients that ignore a C-D of
attachment
, too, and will display some of those parts inline if
they feel like it, but for logging we don't care much about that.)
MIME parts may have (proposed, nominal) filenames associated with
them, from either Content-Type
or Content-Disposition
. However,
neither the presence nor the absence of a MIME filename determines
something's attachment status. Real attachments may have no proposed
filename, and there are mail clients that attach filenames to things
like inline images. And really, I can't argue with them; if the user
told you that this (inline) picture is mydog.jpg
, you're certainly
within your rights to pass this information on in the MIME headers.
The MIME Content-Type provides at least hints, in that you can
probably assume that most mail clients will treat things with any
application/*
C-T as attachments and not try to show them inline.
And if you over-report here (logging information on 'attachments'
that will really be shown inline), it's relatively harmless. It's
possible that mail clients do some degree of content sniffing, so
the C-T is not necessarily going to determine how a mail client
processes a MIME part.
(At one point web browsers were infamous for being willing to
do content sniffing on HTTP replies, so that what you served
as eg text/plain
might not be interpreted that way by some
browsers. One can hope that mail clients are more sane, but
I'm not going to hold my breath there.)
One caution here: trying to make decisions based on things having
specific Content-Type values is a mug's game. For example, if you're
trying to pick out ZIP files based on them having a C-T of
application/zip
, you're going to miss a ton of them; actual real
email has ZIP files with all sorts of MIME types (including the
catch-all value of application/octet-stream). My impression is that
the most reliable predictor of how a mail client will interpret an
attachment is actually the extension of its MIME filename.
(While the gold standard for figuring out if something is a ZIP
file or whatever is actually looking at the data for the MIME part,
please don't use file
(or libmagic) for general file classification.)
One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.
The current approach I'm testing is to use a collection of signs
to pick out attachment-like things, with some heuristics attached.
Does a MIME part declare a MIME filename ending in .zip
? Then
we'll log some information about it. Ditto if it has a Content-Disposition
of attachment
, ditto if it has a Content-Type of application/*
,
and so on. I'm probably logging information about some things that
mail clients will display inline, but it's better than logging too
little and missing things.
(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)