Wandering Thoughts archives


Some advantages of using argparse to handle arguments as well as options

I started using Python long enough ago that there was only the getopt module, which was okay because that's what I was used to from C and other Unix languages (shell, Perl, etc), and then evolved for a bit through optparse; I only started using argparse relatively recently. As a result of all of this background, I'm used to thinking of 'argument handling' as only processing command line switches and their arguments for you, and giving you back basically a list of the remaining arguments, which is your responsibility to check how many there are, parse, and so on.

Despite being very accustomed to working this way, I'm starting to abandon it when using argparse. Part of this is what I discovered the first time I used argparse, namely that it's the lazy way. But I've now used argparse a second time and I'm feeling that there are real advantages to letting it handle as many positional arguments as possible in as specific a way as possible.

For instance, suppose that you're writing a Python program that takes exactly five positional arguments. The lazy way to handle this is simply:

parser.add_argument("args", metavar="ARGS", nargs=5)

If you take exactly five arguments, they probably mean different things. So the better way is to add them separately:

parser.add_argument("eximid", metavar="EXIMID")
parser.add_argument("ctype", metavar="CONTENT-TYPE")
parser.add_argument("cdisp", metavar="CONTENT-DISPOSITION")
parser.add_argument("mname", metavar="MIMENAME")
parser.add_argument("file", metavar="FILE")

Obviously this gives you easy separate access to each argument in your program, but the really nice thing this does is that it adds some useful descriptive context to your program's usage message. If you choose the metavar values well, your usage message will strongly hint to what needs to be supplied as each argument. But we can do better, because argparse is perfectly happy to let you attach help to positional arguments as well as to switches (and it will then print it out again in the usage message, all nicely formatted and so on).

You can do the same thing by hand, of course; there's nothing preventing you from writing the same documentation with manual argument parsing and printing it out appropriately (although argparse does do a good job of formatting it). But it feels easier with argparse and it feels more natural, because argparse lets me put everything to do with a positional argument in one spot; I can name the internal variable, specify its visible short name, and then add help, all at once. If nothing else, this is likely to keep all of these things in sync with each other.

(And I'm not going to underestimate the importance of automatic good formatting, because that removes a point of friction in writing the help message for a given positional argument.)

The result of all of this is that using argparse for positional arguments in my latest program has effortlessly given me not just a check for having the right number of positional arguments but a bunch of useful help text as well. Since I frequently don't touch programs for a year or two, I foresee this being a useful memory jog for future me.

In summary, if I can get argparse to handle my positional arguments in future Python programs, I'm going to let it. I've become convinced that it's not just the lazy way, it's the better way.

(This is where some Python people may laugh at me for having taken so long to start using argparse. In my vague defense, we still have some machines without Python 2.7.)

python/ArgparseForArgsToo written at 23:22:40; Add Comment

What makes a email MIME part an attachment?

If you want to know what types of files your users are getting in email, it's likely that an important prerequisite is being able to recognize attachments in the first place. In a sane and sensible world, this would be easy; it would just be any MIME part with a Content-Disposition header of attachment.

I regret to tell you that this is not a sane world. There are mail clients that give every MIME part an inline Content-Disposition, so naturally this means that most mail clients can't trust an inline C-D and make their attachment versus non-attachment decisions based on other things. (I expect that there are mail clients that ignore a C-D of attachment, too, and will display some of those parts inline if they feel like it, but for logging we don't care much about that.)

MIME parts may have (proposed, nominal) filenames associated with them, from either Content-Type or Content-Disposition. However, neither the presence nor the absence of a MIME filename determines something's attachment status. Real attachments may have no proposed filename, and there are mail clients that attach filenames to things like inline images. And really, I can't argue with them; if the user told you that this (inline) picture is mydog.jpg, you're certainly within your rights to pass this information on in the MIME headers.

The MIME Content-Type provides at least hints, in that you can probably assume that most mail clients will treat things with any application/* C-T as attachments and not try to show them inline. And if you over-report here (logging information on 'attachments' that will really be shown inline), it's relatively harmless. It's possible that mail clients do some degree of content sniffing, so the C-T is not necessarily going to determine how a mail client processes a MIME part.

(At one point web browsers were infamous for being willing to do content sniffing on HTTP replies, so that what you served as eg text/plain might not be interpreted that way by some browsers. One can hope that mail clients are more sane, but I'm not going to hold my breath there.)

One caution here: trying to make decisions based on things having specific Content-Type values is a mug's game. For example, if you're trying to pick out ZIP files based on them having a C-T of application/zip, you're going to miss a ton of them; actual real email has ZIP files with all sorts of MIME types (including the catch-all value of application/octet-stream). My impression is that the most reliable predictor of how a mail client will interpret an attachment is actually the extension of its MIME filename.

(While the gold standard for figuring out if something is a ZIP file or whatever is actually looking at the data for the MIME part, please don't use file (or libmagic) for general file classification.)

One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.

The current approach I'm testing is to use a collection of signs to pick out attachment-like things, with some heuristics attached. Does a MIME part declare a MIME filename ending in .zip? Then we'll log some information about it. Ditto if it has a Content-Disposition of attachment, ditto if it has a Content-Type of application/*, and so on. I'm probably logging information about some things that mail clients will display inline, but it's better than logging too little and missing things.

(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)

spam/KnowingWhatIsAnAttachment written at 01:03:10; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.