Wandering Thoughts archives

2016-04-06

How options in my programs conflict, and where argparse falls short

In my recent entry on argparse I mentioned that it didn't have really top notch handling of conflicting options; instead it only has relatively basic support for this. You might reasonably wonder what it's missing, and thus what top notch argument conflict handling is.

My programs tend to wind up with three sorts of options (command line switches):

  • general switches that affect almost everything
  • mode-selection switches that pick a major mode of operation
  • mode-modifying switches that change how one or more particular major modes work

General switches sometimes conflict with each other (eg --quiet versus --verbose), but apart from that they're applicable all or almost all the time. This is easily represented in argparse with a mutually exclusive group.

Mode selection switches conflict with each other because it normally only makes sense to pick one mode of operation. Some people would make them sub-commands instead (so instead of 'program -L ...' and 'program -Z ...' you'd have 'program op1 ...' and 'program op2 ...'), but I'm a Unix traditionalist and I mostly don't like that approach. Also, often my programs have a default mode where it is just 'program ...'. You can relatively easily represent this in argparse, again with a mutually exclusive group.

(You can't easily handle the case where a general switch happens to be inapplicable to a particular mode, though.)

Mode modifying switches go with one particular mode (or sometimes a couple) and don't make sense without that mode being picked. These logically group with their mode selection switch so that it should be an error to specify them without it. You can't represent this in argparse today; instead you have to check manually, or more likely just allow but silently ignore those switches (because the code paths the program will use doesn't even look at their values).

(And of course you can have nested situations, where some mode modifying switches conflict with each other or spawn sub-modes or whatever.)

It's not hard to see why argparse punts on this. The general case is clearly pretty complicated; you basically need to be able to form multiple arbitrary groups of conflicting arguments and 'these options require this one' sets, and options can be present in multiple groups. Then argparse would have to evaluate all of these constraints to detect conflicts and ideally produce sensible messages about them, which is probably much harder than it looks if there are multiple conflicts.

If I really care about this, I should probably get used to the now fairly common sub-commands approach. Since you can only specify one sub-command you naturally avoid conflicts between them, and argparse can set things up so each sub-command has its own set of options and so on.

ArgparseAndHowOptionsConflict written at 01:06:50; Add Comment

2016-03-30

I've now used Python's argparse module and I like it

The argparse module is the Python 2.7+ replacement for the old optparse module, which itself was basically an extension of the basic getopt module. A number of years ago, when I could barely start using argparse, I took a look at argparse's documentation and wound up with rather negative feelings about it. Having now written a program or two that uses argparse, I'm going to take those old views back.

I don't yet have any opinion on argparse as more than an optparse replacement for putting together quick argument handling for simple commands, but there are a number of things that I like about it for that role. In no particular order:

  • argparse doesn't produce warnings from pychecker. I know, this is a petty thing, but it's still nice to be able to just run 'pychecker program.py' without having to carefully guard 'import optparse' with various magic bits of code.

  • It's nice to be able to skip setting a default value for boolean flags with a store_true or store_false action. One less bit of code noise.

  • argparse gives you a simple way to define conflicting options. It isn't all that general but just having it there means that my programs have somewhat better option error checking. If I had to do it by hand, I might be tempted to not bother.

    (Because of the lack of generality, argparse doesn't give you top notch handling of conflicting arguments; if you want to do a really good job in even moderately complicated situations, you'll have to at least partially roll your own. But argparse is good enough for handling obvious cases in a simple program that you don't expect to be misused except by accident.)

  • It's conveniently lazy to let argparse handle positional arguments too. You can just tell it that there must be exactly N, or at least one, or whatever, and then continue onwards knowing that argparser will take care of all of the error checking and problem reporting and so on. If it gets to your code, you have at least the right number of arguments and you can pull them off the Namespace object it returns.

    (If you want to go a little bit crazy you can do a bunch of argument type validation as argparse processes the arguments. I'm not convinced that this is worth it for simple programs.)

The result of all of this is to reduce the amount of more or less boilerplate code that a simple argparse-using program needs to contain. Today I wrote one where the main function reduced down to:

def main():
   p = setup_args()
   opts = p.parse_args()
   for grname in opts.group:
      process(grname, opts)

All of the 'must have at least one positional argument' and 'some options obviously conflict' and so on error handling was entirely done for me in the depths of parse_args, so my code here didn't even have to think about it.

(I've historically shoved all of the argument parser setup off into a separate function. It's sufficiently verbose that I prefer to keep it out of the way of the actual logic in my main() function; otherwise it can be too hard to see the logic forest for the argument setup trees. With a separate setup_args() function, I can just skip over it entirely when reading the code later.)

ArgparseBriefPraise written at 23:41:54; Add Comment

2016-03-16

How 'from module import ...' is not doing what you may expect

There are a number of reasons to avoid things like 'from module import *'; for instance, it can be confusing later on and you can import more than you expect. But if you're doing this in the context of, say, just splitting a big source file apart it's tempting to say that these are not really problems. You're not going to be confused about where things come from because you're only importing everything from your own source files (and you're not even thinking of them as modules), and it's perfectly okay for there to be namespace contamination because that's kind of the point. But even then there are traps, because 'from module import ...' is not really doing what you might think it's doing.

There's two possible misconceptions here. If you're doing 'from module import *' within your own code, often what you want is for there to be one conjoined namespace where everything lives, both stuff from the other 'module' (really just a file) and stuff from your 'module' (the current file). If you're doing 'from module import A', it's easy (and tempting) to think that when you write plain A in your code, Python is basically automatically rewriting it to really be 'module.A' for you. Neither is what is actually going on in Python, although things can often look like it.

What a 'from module' import really does is it copies things from one module namespace into another. More specifically it copies the current bindings of names. You can think of 'from module import *' as doing something roughly like this:

import module
_this = globals()
for n in dir(module):
    _this[n] = getattr(module, n)

del _this
del module

(This code does not avoid internal names, doesn't respect __all__, and so on. It's a conceptual illustration.)

There are still two completely separate module namespaces, yours and the namespace of module; you have just copied a bunch of things from the module namespace into yours under the same name (or just some things, if you're doing 'from module import A, B'). Functions and classes from module are still using their module namespace, even if a reference to some or all of them has been copied into your module.

(As a corollary to this, things from module mostly can't refer to anything from your module namespace. This is easy to see since you can't have circular imports; if you're importing module to get at its namespace, it can't be importing you to get at yours. (Yes, there are odd ways around this.))

One reason why this matter is that if functions or classes from module update stuff in their module namespace, you may or may not pick it up in your own module. For example, consider the following code in some other module:

gvar = 10
func setit(newval):
  global gvar
  gvar = newval

The gvar that you see in your own module will forever be '10', no matter what calls to setit() have been made. However, code in the other module will see a different value for gvar.

Not all sorts of updates will do this, of course. If gvar is a dictionary and code just adds, changes, and deletes keys in it, everyone will see the same gvar. The illusion of a shared namespace can hold up, but it is ultimately only an illusion and it can be fragile. (And unless you already know Python well, it isn't necessarily easy to see where and when it's going to break down.)

Sidebar: An additional bit of possible weirdness

There are some situations where a module's namespace is more or less overwritten wholesale; the obvious case is reload() of the module. If you reload() a module that has been the subject of 'from module import ...', all of those bare imports are now broken, or at least not updated themselves. You can get into very odd situations this way (especially considering what reloading a module really does).

FromImportBindingIssue written at 23:59:45; Add Comment

I wish I could split up code more easily in Python

This really starts with some tweets:

This Python program has grown to almost 1500 lines. I think I need an intervention, or better data structures, or something.
I also wish it was easier and more convenient to split up a Python program across multiple source files (it's one way Go wins).

The best way to split up a big program is to genuinely modularize it. In other words, find separate pieces of functionality that can be cleanly extracted and turn them into Python modules, in separate files. There are still issues with your main program actually finding the modules, but this can be worked around (even though it is and remains annoying).

However, this assumes that you have a modular structure to start with, with things sensibly separated. If your program started off as a little 200 line thing and then grew step by step into a 1500 line monster (especially iteratively), you may not necessarily have this. That's where Python makes things a little bit awkward. Splitting things up into separate files fundamentally puts them in separate modules and thus separate namespaces; in order to do it, you need to be able to pull your code apart in this way. If your code isn't in this state already you have some degree of rewriting ahead of you, and in the mean time you have a 1500 line Python file.

(In theory you can do 'from modname import *'. In practice this is only faking a single namespace and the fakery can break down in various ways.)

Go may be less elegant here (and Go certainly makes it harder to have separate namespaces), but you can slice a big source file up into several separate ones while keeping them all co-mingled as one module, all using bits and pieces from each other. Sometimes this is more convenient and expedient, even if it may be uglier.

With that said, Python has excellent reasons to require every separate file to be a separate module. To summarize very quickly, it's tied to how you don't just load a file of Python source code, you run it (with things like function and class definitions actually being executable statements, and possibly other interesting things happening). This is a straightforward model that's quite appropriate for an interpreted language, but it imposes certain constraints.

SplittingProgramProblems written at 01:43:45; Add Comment

2016-03-04

Some notes on supporting readline (tab) completion in your Python program

Adding basic readline-style line editing to a Python program that reads input from the user is very simple; as the readline module documentation says, simply importing the module activates this without you having to call anything. However, adding completion is less well documented, so here are some notes about it.

First, you need both a readline completion binding and to register a completion function. The easiest way to get a completion binding is just to set it up explicitly:

readline.parse_and_bind("tab: complete")

You may also want to change the delimiter characters with readline.set_completer_delims. In my own code, I reduced the delimiters to space, tab, and newline. Note that if you have possible completions that include delimiter characters, nothing complains and things sort of work, but not entirely.

So, now we get to completion functions. Readline needs a completion function, and it's easiest to show you how a simple one works:

comps = ["abc", "abdef", "charlie", "horse",]
def complete(text, state):
   # generate candidate completion list
   if text == "":
      matches = comps
   else:
      matches = [x for x in comps if x.startswith(text)]

   # return current completion match
   if state > len(matches):
      return None
   else:
      return matches[state]

readline.set_completer(complete)

You are passed the current 'word' being completed and a 'state', which is a 0-based index. Your completion function's job is to return the state'th completion for the current word, or something other than a string if you've run out of completions, and you'll actually be called with ever-increasing state values until you declare 'no more'. As we see here, the list of completions that you return does not have to be in alphabetical order. Obviously it really should be a stable order for any particular input word; otherwise things will probably get confused.

By the way, readline will completely swallow any exceptions raised by your complete() function. The only symptom of major errors can be that you get fewer or no completions than you expect.

Of course it's common to want to be a little smarter about possible completions based on the context. For instance, you might be completing a command line where the first word is a command and then following words are various sorts of arguments, and it'd be nice not to offer as completions things that would actually be errors when entered. To do this, you often want to know what is before the current word being completed:

def get_cur_before():
   idx = readline.get_begidx()
   full = readline.get_line_buffer()
   return full[:idx]

Because words being completed stop at delimiter characters, anything in this before-the-word text is what readline considers a full word (or words). Otherwise, it would be part of the word currently being completed on. If you want to know what the first complete word of the line is, you can thus do something like:

   pref = get_cur_before()
   n = pref.split()
   cmd = n[0] if len(n) > 0 else ""

You can then use cmd to decide what set of completions to use. Other options are possible with the use of various additional readline functions, but this is all I've needed to use so far for the completions in my code.

Given that your complete() function is being called repeatedly every time the user hits TAB, and that it does all of this examination and selection and matching every time it's called, you might worry about performance here; it sure seems like there's a lot of duplicate work being done here. The good news is that modern computers are very fast, so you probably aren't going to notice this. If you do worry about this, what the rlcompleter module does is that it generates the list of matches when state is 0 (and caches it), and uses the already-cached list whenever state is non-zero. You can probably count on this to keep working in the future.

Speaking from personal experience, it was not all that much work to add readline completion to my program once I worked out what I actually needed to do, and having readline tab completion available is surprisingly fun.

(And of course it's handy. There's a reason everyone loves tab-completing things when they can.)

PS: remember to turn off readline completion if and when it's no longer applicable, such as when you're getting other input from the user (perhaps a yes/no approval). Otherwise things can get at least puzzling. This can be done with readline.set_completer(None).

ReadlineCompletionNotes written at 01:14:10; Add Comment

2016-02-12

Adding a new template filter in Django 1.9, and a template tag annoyance

As the result of my discovery about Django's timesince introducing nonbreaking spaces, I wanted to fix this. Fixing this requires coding up a new template filter and then wiring it into Django, which took me a little bit of flailing around. I specifically picked Django 1.9 as my target, because 1.9 supports making your new template filters and tags available by default without a '{% load ... %}' statement and this matters to us.

When you are load'ing new template widgets, your files have to go in a specific and somewhat annoying place in your Django app. Since I wasn't doing this, I was free to shove my code into a normal .py file. My minimal filter is:

from django import template
from django.template.defaultfilters import stringfilter

register = template.Library()

@register.filter
@stringfilter
def denonbreak(value):
   """Replace non-breaking spaces with plain spaces."""
   return value.replace(u"\xa0", u" ")

The resulting filter is called denonbreak. Although the documentation doesn't say so explicitly, you are specifically handed a Unicode string and so interacting with it using plain strings may not be reliable (or work at all). I suppose this is not surprising (and people using Python 3 expect that anyways).

To add your filter(s) and tag(s) as builtins, you make use of a new Django 1.9 feature in the normal template backend when setting things up in settings.py. This is easiest to show:

TEMPLATES = [
  {
    'BACKEND': 'django.template.backends.django.DjangoTemplates',
    [...]
    'OPTIONS': {
       'builtins': ['accounts.tmplfilters'],
       [...]

(Do not get diverted to 'libraries'; it is for something else.)

At this point you might ask why I care about not needing to {% load %} my filter. The answer is one of the features of Django templates, which is that there is no good way to suppress newlines at the end of template directives.

Suppose you have a template where you want to use your new tag:

{% load something %}
The following pending account requests haven't been
handled for at least {{cutoff|timesince|denonbreak}}:
[...]

Django will remove the {% load %}, but it won't remove the newline after it. Thus your rendered template will wind up starting with a blank line. In HTML this is no problem; surplus blank lines silently disappear when the browser renders the page. But in plain text it's another story, because now that newline is sticking around, clearly visible and often ugly. To fix it you must stick the {% load %} at the start of the first real line of text, which looks ugly in the actual template.

({% if %} is another template tag that will bite you in plaintext because of this. Basically any structuring tag will. I really wish Django had an option to suppress the trailing newline in these cases, but as far as I know it doesn't.)

This issue is why I was willing to jump to Django 1.9 and use the 'builtins' feature, despite what everyone generally says about making custom things be builtins. I just hate what happens to plaintext templates otherwise. Ours are ugly enough as it is because of other tags with this issue.

Django19NewTemplateFilter written at 01:19:34; Add Comment

2016-02-03

Django, the timesince template filter, and non-breaking spaces

Our Django application uses Django's templating system for more than just generating HTML pages. One of the extra things is generating the text of some plaintext email messages. This trundled along for years, and then a Django version or two ago I noticed that some of those plaintext emails had started showing up not as plain ASCII but as quoted-printable with some embedded characters that did not cut and paste well.

(One reason I noticed is that I sometimes scan through my incoming email with plain less.)

Here's an abstracted version of such an email message, with the odd bits italicized:

The following pending account request has not been handled for at least 1 week.

  • <LOGIN> for Some Person <user@somewhere>
    Sponsor: A professor
    Unhandled for 1 week, 2 days (since <date>)

In quoted-printable form the spaces in the italicized bits were =C2=A0 (well, most of them).

I will skip to the punchline: these durations were produced by the timesince template filter, and the =C2=A0 is the utf-8 representation of a nonbreaking space, U+00A0. Since either 1.5 or 1.6, the timesince filter and a couple of others now use nonbreaking spaces after numbers. This change was introduced in Django issue #20246, almost certainly by a developer who was only thinking about the affected template filters being used in HTML.

In HTML, this change is unobjectionable. In plain text, it does any number of problematic things. Of course there is no option to change this or to control this behavior. As the issue itself cheerfully notes, if you don't like this change or it causes problems, you get to write your own filter to reverse it. Nor is this documented (and the actual examples of timesince output in the documentation use real spaces).

Perhaps you might say that documenting this is unimportant. Wrong. In order to find out why this was happening to my email, I had to read the Django source. Why did I have to do that? Because in a complex system there are any number of places where this might have been happening and any number of potential causes. Django has both localization and automatic safe string quotation for things you insert in templates, so maybe this could have been one or both in action, not a deliberate but undocumented feature in timesince. In the absence of actual documentation to read, the code is the documentation and you get to read it.

(I admit that I started with the timesince filter code, since it did seem like the best bet.)

Is the new template filter I've now written sufficient to fix this? Right now, yes, but of course not necessarily in general in the future. Since all of this is undocumented, Django is not committed to anything here. It could decide to change how it generates non-breaking spaces, switch to some other Unicode character for this purpose, or whatever. Since this is changing undocumented behavior Django wouldn't even have to say anything in the release notes.

(Perhaps I should file a Django bug over at least the lack of documentation, but it strikes me as the kind of bug report that is more likely to produce arguments than fixes. And I would have to go register for the Django issue reporting system. Also, clearly this is not a particularly important issue for anyone else, since no one has reported it despite it being a three year old change.)

DjangoTimesinceNBSpaces written at 23:42:32; Add Comment

2016-01-28

Modern Django makes me repeat myself in the name of something

One of the things that basically all web frameworks do is URL routing, where they let you specify how various different URL patterns are handled by various different functions, classes, or whatever. Once you have URL routing, you inevitably wind up wanting reverse URL routing: given a handler function or some abstract name for it (and perhaps some parameters), the framework will generate the actual URL that refers to it. This avoids forcing you to hard-code URLs into both code (for eg HTTP redirections) and templates (for links and so on), which is bad (and annoying) for all sorts of reasons. As a good framework, Django of course has both powerful URL routing and powerful reverse URL generation.

Our Django web app is now about five years old and has not been substantially revised since it was initially written for Django 1.2. Back in Django 1.2, you set up URL routing and reversed routing something like this:

urlpatterns = patterns('',
   (r'^request/$', 'accounts.views.makerequest'),
   [...]

And in templates, you got reverse routing as:

<a href="{% url "accounts.views.makerequest" %}"> ... </a>

Here accounts.views.makerequest is the actual function that handles this particular view. This is close to the minimum amount of information that you have to give the framework, since the framework has to know the URL pattern and what function (or class or etc) handles it.

Starting in Django 1.8 or so, Django changed its mind about how URL reversing should work. The modern Django approach is to require that all of your URL patterns be specifically named, with no default. This means that you now write something like this:

urlpatterns = [
   url(r'^request/$', accounts.views.makerequest, name="makerequest"),
   [...]

And in templates and so on you now use the explicit name, possibly with various levels of namespaces.

Now, Django has a few decent reasons for wanting support for explicit reverse names on URL patterns; you can have different URL patterns map to the same handler function, for example, and you may care about which one your template reverses to. But in the process of supporting this it has thrown the baby out with the bathwater, because it has made there be no default value for the name. If you want to be able to do reverse URL mapping for a pattern, you must explicitly name it, coming up with a name and adding an extra parameter.

Our Django web app has 38 different URL patterns, and I believe all of them get reversed at some point or another. All of them have unique handler functions, which means that all of them have unique names that are perfectly good. But modern Django has no defaults, so now Django wants me to add a completely repetitive name= parameter to every one of them. In short, modern Django wants me to repeat myself.

(It also wants me to revise all of the templates and other code to use the new names. Thanks, Django.)

As you may guess, I am not too happy about this. I am in fact considering ugly hacks simply because I am that irritated.

(The obvious ugly hack is to make a frontend function for url() that generates the name by looking up the __name__ of the handler function.)

PS: Django 1.8 technically still supports the old approach, but it's now officially deprecated and will be removed in Django 1.10.

DjangoUrlReversingRepeatingMyself written at 00:12:58; Add Comment

2016-01-25

A Python wish: an easy, widely supported way to turn a path into a module

Years ago I wrote a grumpy entry about Django 1.4's restructured directory layout and mentioned that I was not reorganizing our Django web app to match. In the time since then, it has become completely obvious that grimly sticking to my guns here is not a viable answer over the long term; sooner or later, ideally sooner, I need to restructure the app into what is now the proper Django directory layout.

One of the reasons that I objected to this (and still do) is the problem of how you make a directory into a module; simply adding the parent directory to $PYTHONPATH has several limitations. Which is where my wish comes in.

What I wish for is a simple and widely supported way to say 'directory /some/thing/here/fred-1 is the module fred'. This should be supported on the Python command line, in things like mod_wsgi, and in Python code itself (so you could write code that programmatically added modules this way, similar to how you can extend sys.path in code). The named module is not imported immediately, but if later code does 'import fred' (or any of its variants) it will be loaded from /some/thing/here/fred-1 (assuming that there is no other fred module found earlier on the import path). All of the usual things work from there, such as importing submodules ('import fred.bob') and so on. The fred-1 directory would have a __init__.py and so on as normal.

(Note that it is not necessary for the final directory name to be the same as the module name. Here we have a directory being fred-1 but the module is fred.)

Given PEP 302, I think it's probably possible to implement this in Python code. However, both PEP 302 and the entire current Python import process make my head hurt so I'm not sure (and there are probably differences between Python 2 and Python 3 here).

(I wrote some notes to myself about Python packaging a while back, which is partly relevant to this quest. I don't think .egg and .zip files let me do what I want here, even if I was willing to pack things up in .zips, since I believe their filenames are bound to the package/module name.)

PathIntoModuleWish written at 01:38:42; Add Comment

2015-12-13

I still believe in shimming modules for tests

A commentator on my 2008 entry about shimming modules for testing recently asked if I still liked this approach today. My answer is absolutely yes; I continue to love that Python lets me do this, and I feel that it (or a variation on it) is the best approach for mocking out parts of standard modules. There are two reasons for this.

The first is that I continue to believe in what I wrote in my old entry about why I do this; I would much rather have clean code and dirty tests than clean tests and dirty code. I consider code that is full of artificial dependency injection to be dirty. For instance, it's hard to think of a reason why you'd need to do DI for socket module functions apart from being able to inject fakes during testing. Artificially contorting my code to test it bugs me enough that I basically don't do it for my own programming.

The second reason follows on from the first reason, and it is that monkey patching modules this way is an excellent way of exactly simulating or exactly replaying the results you would get from them in the real world under various circumstances. If you discover that some tricky real world scenario gives your code problems, you can capture the low level results of interacting with the outside world and then use them for your future tests. You don't need some cooperative outside entity that fails in a specific controlled way, because you can just recreate it internally.

Without some way of doing this 'exact replay' style of injecting results, what at least I wind up with is tests that can have subtle failures. Synthetic high level data can be quietly wrong data, and while synthetic low level data can be wrong too my view is that I'm much more likely to notice because I know exactly what, eg, a DNS lookup should return.

(If I don't know exactly what a low level thing should return, I'm likely to actually test it and record the results. There are ways for this to go wrong, for example if I can't naturally create some malfunction that I want to test against, but I think it's at least somewhat less likely.)

Finally, I simply feel happier if the code I'm testing uses code paths that are as close to what it will use outside of testing. With monkey patching modules for tests, the code paths are authentic right down until they hit my monkey patched modules. With dependency injection, some amount of code is not being tested because it's the code involved with creating and injecting the real dependencies. Probably I will find out right away if this code has some problem, but I can imagine ways to subtly break things and it makes me a bit nervous (somewhat like my issues with complex mocks).

ShimmingModulesForTestsII written at 02:48:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.