2007-02-28
Using Unix domain sockets from Python
Using Unix domain sockets from Python's socket module is pretty easy, but underdocumented. Since I've done it (and I may want to do it again sometime):
- you need to create the socket with an explicit family of
AF_UNIX. I usually go whole-hog and explicitly specifySOCK_STREAMas well. - the address argument to
.bind()and.connect()is the path of the Unix socket file. For.bind(), the file can't already exist, which means that it has to be in a directory that you have write permissions on and you probably want to try to remove it if it's there already.Note that the address argument is restricted to be fairly short, well below the maximum Unix path or filename limits.
- nothing removes the socket file when the socket closes down, hence
the probable need to remove it before you try to
.bind(). (Yes, on an orderly shutdown you can remove it. And when your daemon doesn't shut down in an orderly way?) - on Linux, but not necessarily elsewhere, people who want to talk to
your daemon need write permission on the socket file. Since
.bind()creates the file, you are going to need to use eitheros.umask()oros.chmod()if you want other UIDs to be able to talk to it; the latter is the simpler option.(This implies that to portably restrict access to your daemon, you need to put the socket file in a mode 0700 directory that you own. And to portably open access to everyone, you need to both put the socket file in an open directory and make it mode 0666 or so.)
- Unix domain sockets have no peername. Don't bother trying.
Moral: Unix domain socket portability is hard. On the other hand, you do get UID-based access restrictions for free.
Using Unix domain sockets from C is similar but more annoying. You use
a struct sockaddr_un, setting the sun_family to AF_UNIX and
then memcpy()'ing the path of the socket file into sun_path. (Be a
good program and check for buffer overflows first.)
2007-02-25
Ordered lists with named fields for Python
I periodically find myself dealing with structures that are basically ordered lists with named fields, where elements 0, 1, and 2 are naturally named 'a', 'b', and 'c' and sometimes you want to refer to them by name instead of having to remember their position. This pattern even crops up in the standard Python library, often with functions that started out just returning an ordered list and grew the named fields portion later as people discovered how annoying it was to have to remember that the hour was field 3.
This being Python, I've built myself some general code to add named
fields on top of sequence types like list or tuple. For maximum
generality my code supports using field names both as attribute names
and as indexes, so you can use both obj.field and obj["field"], and
you can even do crazy things like obj["field":-1]. The code:
class GetMixin(object):
fields = {}
def _mapslice(self, key):
s, e, step = key.start, key.stop, key.step
if s in self.fields:
s = self.fields[s]
if e in self.fields:
e = self.fields[e]
return slice(s, e, step)
def _mapkey(self, key):
if isinstance(key, tuple):
pass
elif isinstance(key, slice):
key = self._mapslice(key)
elif key in self.fields:
key = self.fields[key]
return key
def __getitem__(self, key):
key = self._mapkey(key)
return super(GetMixin, self).__getitem__(key)
def __getattr__(self, name):
if name in self.fields:
return self[self.fields[name]]
raise AttributeError, \
"object has no attribute '%s'" % name
class SetMixin(GetMixin):
def __setitem__(self, key, value):
key = self._mapkey(key)
super(SetMixin, self).__setitem__(key, value)
def __setattr__(self, name, value):
if name in self.fields:
o = self.fields[name]
self[o] = value
else:
self.__dict__[name] = value
class Example(SetMixin, list):
fields = {'a': 0, 'b': 1, 'c': 2}
The fields class variable is a dictionary mapping the names of the
fields to their index offsets; it need not include all fields, and
not all named fields necessarily have values for a particular list
(since nothing checks the list length).
GetMixin just lets you read the named fields and can be mixed in with
tuples; SetMixin lets you write to them by name too, and so needs to
be mixed in with lists or other writable sequence types.
The easiest way to generate the fields value for the usual case
of sequential field names starting from the first element of the list
is to use a variant of the enumerate function from the itertools
recipes:
from itertools import *
def enum_args(*args):
return izip(args, count())
class Example(SetMixin, list):
fields = dict(enum_args('a', 'b', 'c'))
(If you're going to do this a lot, make a version of enum_args that
does the dict() step too.)
Inheriting from list, tuple, etc does have one practical wart: you
probably want to avoid using field names that are the names of methods
that you want to use, because you won't be able to use the obj.field
syntax for accessing them. Amusingly, you will be able to set them
using that syntax, because __setattr__ gets called for everything,
existing attributes included (which is why it needs the dance at the
end with the instance's __dict__).
(This code is not quite neurotically complete; truly neurotically
complete code would make the available fields appear in dir()'s
output. But I don't want to try to think what sort of hacks that
would take, since I am seeing visions of dancing metaclasses that
automatically create properties for each field name.)
2007-02-24
Some things to remember when implementing __getitem__
I've recently been doing some work on classes derived from list and
tuple that fiddle with the behavior of __getitem__, and ran into
a couple of surprises that I am going to write down so that I remember
them in the future:
- for the simple '
obj[i:j]' case,__getitem__is not called if the class has a__getslice__method.listandtupleboth do, despite the method being labeled as deprecated since Python 2.0. - the moment you use an
iorjthat is not a simple integer (including floating point numbers, but not including long integers), it turns into a call to__getitem__with a slice object as the key. (This is simple slicing.)
Supporting the full slice syntax in a __getitem__
implementation makes my head hurt; you can get handed a slice object,
or a tuple that can contain at least slice objects, ellipsis objects,
and numbers (and probably more that I don't know about). Just throw
TypeError instead; it's what lists and tuples do.
Checking that your __getitem__ has not been called with such things,
so that you can throw TypeError appropriately, is harder than it should
be. I personally wish that __getitem__ wasn't so generic; it seems
un-Pythonic to have to inspect the type of your argument to figure out
what to do with it.
(The better way would be to have one method to implement plain subscripting, one for simple slices, and a third one for extended slicing. Unfortunately it's too late for that now.)
2007-02-22
A simplified summary of Python's method resolution order
Crudely summarized, method resolution order is how Python decides where to look for a method (or any other attribute) on a class that inherits from several classes. Python actually has two; a simple one for old style classes and a rather complicated one for new style classes.
(Technically, method resolution order applies even for single inheritance classes, it's just that they have a very boring one.)
With the old style class:
class Foo(A, B, C):
pass
Python will look up Foo.foo by looking for foo on A and all of its
ancestors, then B and all its ancestors, and finally C and all its
ancestors; that is, the method resolution order is left to right, depth
first.
This order is nice and simple but blows up if A and B have a common
ancestor with behavior that B wants to override, which is why new style
classes need a different scheme. (All new style classes are ultimately
rooted at object, which defines various default actions for things
like getting and setting attributes.)
The complete description of the method resolution order for new style classes is somewhat complicated. For simple class structures, it can be summarized as left to right and depth first but common ancestor classes are only checked after all of their children have been checked. Thus, with new style classes:
class A1(object): pass class A2(object): pass class A3(object): pass class B2(A2): pass class B3(A3): pass class C3(B3): pass class Foo(C3, A1, B2): pass
The method resolution order for Foo.foo is Foo, C3, B3, A3, A1, B2,
A2, and then object; as the common ancestor, object is checked
only after all of its children have been. Note that the MRO can vary
drastically between a parent and a child class; C3's MRO is just C3, B3,
A3, and object.
In case of doubt or curiosity you can find the MRO of any new style
class in its __mro__ attribute. Normally you don't need to care
about its value and should use the super() builtin if you need to find
the next class up in the context of a particular object or class.
(This is the kind of entry I write partly to make sure I have all this straight in my own head.)
A note about the ordering of mixin classes
In Python, when you have a class that inherits from both a primary class and some mixin classes (for example, if you're using the SocketServer stuff), it's conventional to declare your class's inheritance list with the primary class first:
class Real(primary, mixin1, mixin2):
....
However, an important safety tip: if your mixin class overrides methods of the primary class, it has to be first. Failure to observe this safety tip can cause head-scratching bugs followed by head-smacking embarrassment.
(Since I was mixing stuff in to standard types like list and tuple
and str, I spent a certain amount of time wondering if the interpreter
had special direct magic for them that meant I couldn't hijack and
augment their standard behavior. I felt somewhat foolish when there
turned out to be a much simpler explanation.)
2007-02-21
Fixing Python's string .join()
The thing that has always irritated me about string .join() is that it
doesn't stringify its arguments; if one of the things in the sequence to
be joined isn't a string, .join() doesn't call str() on it, it just
pukes. This is periodically annoying and inconvenient.
It recently occurred to me that this can be fixed, like so:
class StrifyingStr(str):
def join(self, seq):
s2 = [str(x) for x in seq]
return super(StrifyingStr, self).join(s2)
def str_join(js, seq):
return StrifyingStr(js).join(seq)
(A similar version for Unicode strings is left as an exercise for the reader.)
You might think that a generator expression would be more efficient
than a list comprehension here; in fact, that's what my first version
used. Then I actually timed it, and found out that regardless of whether
or not .join() was passed a list or an iterator, and for sizes of
the list (or iterator) from 10 elements to 10,000, doing the list
comprehension was slightly faster.
Now that I have this I can think of a number of places where I may wind up using it, which kind of makes me wish I'd scratched this irritation before now.
2007-02-17
Programming fun
Programming fun is spending a couple of hours writing, revising, and tuning a DWiki feature that I'm not sure I'm actually going to like well enough to keep.
I'm serious, not being sarcastic; I like programming, but not all ideas for improving a program pan out. An idea that seems great in my mind can be less attractive once I've made it concrete and explored all of the bits that I could gloss over when it was just thought-stuff. So I can spend a pleasant and enthused couple of hours and wind up with something I may end up just throwing away (well, sticking on a back shelf in case I get a clever idea about how to improve it).
Python is a good language for doing this sort of thing in, since it requires very little annoying make-work. It's one thing to materialize my ideas and discover that they don't pan out, but it'd be an entirely different thing to have to slog through a lot of grit for no purpose. (Or to put it another way: the more un-fun involved, the more I'd need to get something useful out of it.)
Sidebar: today's nascent DWiki feature
I'm tired of having to write x86_64 as [[x86_64|]] most of the time, and yesterday's entry has raw text that looks like:
[[atkbd_connect|]] {{C:rarr}}
[[atkbd_activate|]] {{C:rarr}}
[[i8042_interrupt|]] {{C:rarr}}
...
One of my guiding principles for DWikiText is that it has to look attractive and natural in raw text; this is clearly not happening with either example. My first idea is 'processing notes', in a form stolen from *roff:
.pn no _ sub -> {{C:rarr}}atkbd_connect -> atkbd_activate -> i8042_interrupt -> ...
I'm not sure that I like this format for processing notes (or even the name), plus it doesn't really improve writing x86_64 once or twice in an entry, plus it turns out to have some unaesthetic consequences; the only good way to do the text substitutions right now makes them affect things like URLs (which are in the text but not actually text).
2007-02-13
On Python's grammar
Python's grammar looks imposingly complex from the outside, but the more I've thought about it the more I've realized that it's simple but clever. In particular, the way the work is split between the tokenizer and the actual grammar simplifies both.
(Technically you could claim that Python's grammar is straightforward because the tokenizer takes care of all the hard bits, but I consider this ducking the issue.)
The actual grammar is typical for an Algol-style language. You have the big set of rules for expressions, and the big shunting yards of the basic and compound statements, and that's mostly it. Python gains some simplicity because you can pretty much define anything anywhere (you can embed a class definition in the middle of a function defined inside a function inside a class definition, if you really want to), so it doesn't have to put ordering and placement constraints into the grammar.
(Python also cleverly sets up the grammar so that indentation only shows
up in one rule, 'suite', the generic definition of a block of statements
such as the body of a while or a class definition.)
The tokenizer too is pretty normal, although this is harder to see since it's hand-coded and thus there is no simple to read set of production rules, just the rather dry lexical structure description (very few people read language descriptions for fun, whereas you can actually skim a flex input file and get a sense of anything unusual going on). It does have to track the indentation level to generate synthetic INDENT and DEDENT tokens when it changes, and do implicit line joining, but neither are too complicated.
I am not sure that I would have come up with this split to start with, if I was designing a similar language where indentation was significant, and pretty much all of the other designs I can think of would be more complicated to implement.