2010-07-13
Single-level list flattening in Python
I have been writing a bunch of Python code lately which involves turning
lists of lists into plain lists. As it happens Python does not have
a flatten() built-in, but people have come up with any number of
ingenious ways to do this, some of which I actually like.
First, a stupid non-way that is worth writing down for posterity:
reduce(lambda x,y: x.extend(y), somelists, [])
This version is a classic mistake (one I made today, in fact). Your
reward for trying this is the cryptic error "AttributeError: 'NoneType'
object has no attribute 'extend'". What this really means is that
you've forgotten that .extend() doesn't return anything (ie, it
returns None), and the function that reduce() uses must return the
accumulator variable. You can fix this by picking an equivalent list
operation that does return a result:
reduce(lambda x,y: x+y, somelists, [])
This is both ugly and inefficient, although the inefficiency shouldn't bother you for small lists-of-lists. It is also equivalent to the simpler and slightly faster version:
sum(somelists, [])
All addition-based flatteners have the drawback that they only work on actual lists. If you have a list of tuples, or just a list of some sequence type that is not a list, you lose; list addition works only on two lists.
The best solution is a rare good use of a multi-level list comprehension:
[item for sublist in somelists for item in sublist]
This will work on any sequence type and for any mixture of sequence types (or iterables); you can have an iterable that returns a mixture of lists, tuples, and other iterables and everything will work out and you will get a list at the end.
The multi-level list comprehension approach is both the fastest and the
most general, but it's kind of verbose and hard to follow if you're
not completely up on list comprehensions. I want to like the sum()
approach better, but it too is the kind of thing that makes me go 'oh,
how clever', which is not necessarily a good idea in general code. All
things considered, I'll probably use either the list comprehension
approach or just the plain for loop one in future code.
PS: this is the kind of entry that I write in an attempt to make it
all stick in my head. And in a demonstration that I repeat myself, the
example I used in MultilevelListComps was for list flattening (although
back in 2005 it appears that lists did not have .extend(), or at least
it slipped my mind).
(I mined here and here for this information.)
Sidebar: the boring plain approach
The most verbose but straightforward approach is of course the classic
for loop:
accum = []
for e in somelists:
accum.extend(e)
Just as with the list comprehension version, this works with any sequence types or iterable types; it doesn't require lists. On advantage of this approach is that anyone (your future self included) can easily recognize what the code is doing.
2010-07-02
A gotcha with Python's socket.htonl() and ntohl()
Here is something that I ran into the other day:
the socket module's
htonl() and ntohl() functions will return signed numbers under
some circumstances, not unsigned ones, which means that for some input
values they will return negative numbers.
(Some quick testing suggests that this happens on 32-bit x86 machines
but not 64-bit x86 Linux machines; all of the 32-bit machines I have
access to are running some version of Python 2.5. A quick test to see if
this is happening to you is checking what socket.htonl(255) returns.)
This is kind of annoying. The underlying C API is specifically documented as returning unsigned numbers and in many circumstances you really need them even in Python, which means that you need to force the results into unsigned yourself.
(This is probably an actual Python bug that thus might get fixed sometime.)
PS: to fix this problem, just mask the results with 0xffffffffL:
M32 = 0xffffffffL def htonl(n): return socket.htonl(n) & M32 def ntohl(n): return socket.ntohl(n) & M32
As a side effect, this explicitly forces the results to be Python longs instead of integers. This generally doesn't matter (although I know of one case where it does or did), and besides you don't really have a choice; if you want to represent arbitrary IPv4 addresses as unsigned integers on 32-bit Python machines, you have no choice but to use Python longs.