How recursively flattening a list raises a Python type question

February 26, 2017

Today I wound up reading Why it's hard for programmers to write a program to flatten a list? (via), where the quiz challenge put forward is to turn an input like [1,[2,3], [4, [5,6]]] into [1,2,3,4,5,6]. My immediate reaction was that I'd do this in Python rather than in any statically typed language I know, because all of them make the input type here hard to represent. But then I realized that doing this in Python raises another type-related question.

If we stick exactly to the specification (and directly implement it), the result is pretty simple and straightforward:

def flatten(inlst):
    olst = []
    for i in inlst:
        if isinstance(i, int):
            olst.append(i)
        elif isinstance(i, list):
            olst.extend(flatten(i))
        else:
            raise ValueError("invalid element in list")
    return olst

(You can optimize this by having a _flatten internal function that gets passed the output list, so you don't have to keep building lists and then merging them into other lists as you work down and then back up the recursion stack. Also, I'm explicitly opting to return an actual list instead of making this a (recursive) generator.)

However, this code is not very Pythonic because it is so very narrowly typed. We can relax it slightly by checking for isinstance(i, (int, float)), but even then most people would say that flatten() should definitely accept tuples in place of lists and probably even sets.

If we're thinking about being Pythonic and general, the obvious thing to do is check if the object is iterable. So we write some simple and general code:

def flatten2(inlst):
    olst = []
    for i in inlst:
        try:
            it = iter(i)
        except TypeError:
            it = None
        if it is None:
            olst.append(i)
        else:
            olst.extend(flatten2(i))
    return olst

This should flatten any type (or mixture of types) that contains elements, as denoted by the fact that it's iterable. It looks good and passes initial tests. Then some joker calls our code with flatten2(["abcdef",]) and suddenly we have a problem. Then another joker calls our code with flatten2([somedict,]) and files a bug that our code only flattens the keys of their dictionary, not the keys and values.

(As an exercise, can you predict in advance, without trying it, what our problem is with flatten2(["abcdef",]), and why it happens? I got this wrong when I was writing and testing this code in Python 3 and had to scratch my head for a bit before the penny dropped.)

The problem here is that 'is iterable' is not exactly what we want. Some things, such as strings, are iterable but should probably be treated as indivisible by flatten2(). Other things, such as dicts, are iterable but the default iteration result does not fully represent their contents. Really, not only is Python lacking a simple condition for what we want, it's arguably not clear just what we want to do if we're generalizing flatten() (and what making it 'Pythonic' really means).

One valid answer is that we will explicitly check for container types that are close enough to what we want, and otherwise mostly return things as-is. Here we would write a version of flatten() that looked like this:

def flatten3(inlst):
    olst = []
    for i in inlst:
        if isinstance(i, (list, tuple, set)):
            olst.extend(flatten3(i))
        elif isinstance(i, dict):
            raise ValueError("dict not valid in list")
        else:
            olst.append(i)
    return olst

We could treat dicts as single elements and just return them, but that is probably not what the caller intended. Still, this check feels dubious, which is a warning sign.

As a minimum, it would be nice to have a Python abstract type or trait that represented 'this is a container object and iterating it returns a full copy of its contents'; you could call this the property of being list-like. This would be true for lists, tuples, and sets, but false for dicts, which would give us a starting point. It would also be true for strings, but you can't win them all; when dealing with iterable things, we'll probably always have to special-case strings.

(I'd go so far as arguing that making strings iterable by default was a Python mistake. It's one of those neat features that winds up getting in the way in practice.)

I don't have an answer here, by the way. If I was in this situation I might either write and carefully document a version of flatten2() (specifying 'recursively flattens any iterable thing using its default iterator; this will probably not do what you want for dicts'), or go with some version of flatten3() that specifically restricted iteration to things that I felt were sufficiently list-like.

(I'd worry about missing some new popular type over time, though. Ten years ago I might not have put set in the list, and who knows what I'm missing today that's going to be popular in Python in the future. Queues? Trees? Efficient numerical arrays?)

Written on 26 February 2017.
« A single email message with quite a lot of different malware
In Python, strings are infinitely recursively iterable »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Feb 26 02:09:00 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.