How recursively flattening a list raises a Python type question

February 26, 2017

Today I wound up reading Why it's hard for programmers to write a program to flatten a list? (via), where the quiz challenge put forward is to turn an input like [1,[2,3], [4, [5,6]]] into [1,2,3,4,5,6]. My immediate reaction was that I'd do this in Python rather than in any statically typed language I know, because all of them make the input type here hard to represent. But then I realized that doing this in Python raises another type-related question.

If we stick exactly to the specification (and directly implement it), the result is pretty simple and straightforward:

def flatten(inlst):
    olst = []
    for i in inlst:
        if isinstance(i, int):
            olst.append(i)
        elif isinstance(i, list):
            olst.extend(flatten(i))
        else:
            raise ValueError("invalid element in list")
    return olst

(You can optimize this by having a _flatten internal function that gets passed the output list, so you don't have to keep building lists and then merging them into other lists as you work down and then back up the recursion stack. Also, I'm explicitly opting to return an actual list instead of making this a (recursive) generator.)

However, this code is not very Pythonic because it is so very narrowly typed. We can relax it slightly by checking for isinstance(i, (int, float)), but even then most people would say that flatten() should definitely accept tuples in place of lists and probably even sets.

If we're thinking about being Pythonic and general, the obvious thing to do is check if the object is iterable. So we write some simple and general code:

def flatten2(inlst):
    olst = []
    for i in inlst:
        try:
            it = iter(i)
        except TypeError:
            it = None
        if it is None:
            olst.append(i)
        else:
            olst.extend(flatten2(i))
    return olst

This should flatten any type (or mixture of types) that contains elements, as denoted by the fact that it's iterable. It looks good and passes initial tests. Then some joker calls our code with flatten2(["abcdef",]) and suddenly we have a problem. Then another joker calls our code with flatten2([somedict,]) and files a bug that our code only flattens the keys of their dictionary, not the keys and values.

(As an exercise, can you predict in advance, without trying it, what our problem is with flatten2(["abcdef",]), and why it happens? I got this wrong when I was writing and testing this code in Python 3 and had to scratch my head for a bit before the penny dropped.)

The problem here is that 'is iterable' is not exactly what we want. Some things, such as strings, are iterable but should probably be treated as indivisible by flatten2(). Other things, such as dicts, are iterable but the default iteration result does not fully represent their contents. Really, not only is Python lacking a simple condition for what we want, it's arguably not clear just what we want to do if we're generalizing flatten() (and what making it 'Pythonic' really means).

One valid answer is that we will explicitly check for container types that are close enough to what we want, and otherwise mostly return things as-is. Here we would write a version of flatten() that looked like this:

def flatten3(inlst):
    olst = []
    for i in inlst:
        if isinstance(i, (list, tuple, set)):
            olst.extend(flatten3(i))
        elif isinstance(i, dict):
            raise ValueError("dict not valid in list")
        else:
            olst.append(i)
    return olst

We could treat dicts as single elements and just return them, but that is probably not what the caller intended. Still, this check feels dubious, which is a warning sign.

As a minimum, it would be nice to have a Python abstract type or trait that represented 'this is a container object and iterating it returns a full copy of its contents'; you could call this the property of being list-like. This would be true for lists, tuples, and sets, but false for dicts, which would give us a starting point. It would also be true for strings, but you can't win them all; when dealing with iterable things, we'll probably always have to special-case strings.

(I'd go so far as arguing that making strings iterable by default was a Python mistake. It's one of those neat features that winds up getting in the way in practice.)

I don't have an answer here, by the way. If I was in this situation I might either write and carefully document a version of flatten2() (specifying 'recursively flattens any iterable thing using its default iterator; this will probably not do what you want for dicts'), or go with some version of flatten3() that specifically restricted iteration to things that I felt were sufficiently list-like.

(I'd worry about missing some new popular type over time, though. Ten years ago I might not have put set in the list, and who knows what I'm missing today that's going to be popular in Python in the future. Queues? Trees? Efficient numerical arrays?)


Comments on this page:

By yggdrasil at 2017-02-26 03:59:09:

collections.Sequence?

>>> isinstance({}, collections.Sequence)
False

I was rather surprised not seeing the collections module mentioned at all in the post.

By cks at 2017-02-26 15:19:07:

Sets aren't collections.Sequences (because they're not reversible) and I think we should flatten sets. I took a look at the stuff in collections.abc but couldn't find anything that seemed quite on, so I omitted a discussion of it (which may have been a mistake).

The other issue I see with using collections.abc classes for isinstance() checks is that I suspect a lot of user-written classes do not have all of the methods necessary to be considered a Collection or Sequence (or a Mapping, so one can detect and exclude that). This would give false negatives for classes that you do want to be flattenable.

If we ignore that, checking for things that are Collections but not Mappings is probably the closest we can come. We know mappings have key/value pairs, so we know it's at least not clear whether iterating them will give the full contents.

By Ewen McNeill at 2017-02-26 15:36:13:

Having run into this myself a few times recently (for basically the same reason: trying to normalise externally loaded data which may be a bare string or a list), I tend to agree that strings probably should not have been automatically iterable in the way that other sequences are. It's definitely useful to be able to iterate over the characters in a string -- but it feels like most of the time I end up doing it by accident and having to debug it, then code around it.

Maybe iteration over the elements of a string should require a str.items() or str.elem() or str.elements() which returns a very thin proxy to be the iterator?

Ewen

PS: I too ended up with isinstance(foo, str) and/or isinstance(foo, list) as a quick detection work around, after also quickly looking through collections to find the right supertype that matched list, etc, but not str. Fortunately due to the way the data was coming in I could be fairly sure it was either list or str.

By Ewen McNeill at 2017-02-26 21:08:55:

FTR, this question also gets asked on Stack Exchange fairly often, of which this question and this question seem to be some of the better answers. Among other things they suggest using basestring to check for string-like types... but as answer to the second question point out, that's non-portable over the Python 2/Python 3 chasm :-( (It was at about this "non-portable over Python 2/Python 3 stage that I gave up and hard coded the type checks, because I was trying to target Python 2.6 - 3.5 or at least Python 2.7 - 3.5, and could make reasonable assumptions about the few types I could expect.)

At a quick glance it appears few, if any of them, are prepared to handle a StringsRecursivelyIterable sidebar (self-containing container) case...

Ewen

From 172.93.51.212 at 2017-03-11 08:18:24:
   def flatten(*iterable):                                                                                                             
       """hopefully equivalent to python2's compiler.ast.flatten"""                                                                    
       # thanks http://stackoverflow.com/a/16176779                                                                                    
       import collections                                                                                                              
       from itertools import chain                                                                                                     
       for el in chain(*iterable):                                                                                                     
           if isinstance(el, collections.Iterable) and not isinstance(el, str):                                                        
               for subel in flatten(el):                                                                                               
                   yield subel                                                                                                         
           else:                                                                                                                       
               yield el
Written on 26 February 2017.
« A single email message with quite a lot of different malware
In Python, strings are infinitely recursively iterable »

Page tools: View Source, View Normal, Add Comment.
Search:
Atom Syndication: Recent Comments.

Last modified: Sun Feb 26 02:09:00 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.