Wandering Thoughts archives


Humanizing numbers in Python through a regexp substitution function

Recently I was looking at files that contained a bunch of sizes in bytes with very widely varying magnitudes, something like this:

file 10361909248
percpu 315360
inactive_file 8666644480
active_file 1695264768
slab_reclaimable 194324760
slab 194324760

(This is from Linux cgroup memory accounting.)

I find it hard to look at these numbers and have any feel for how big they are in absolute or relative terms, especially if I don't want to spend a lot of time thinking about it. It's much easier for me to read these numbers if they're humanized into things like '9.7G', '308.0K', and '185.3M'. To make these files more readable, I wrote a Python filter program to replace these raw byte counts with their humanized versions.

One reason I used Python for this filter is that it's my default choice for Unix text processing that requires more than sed or a light veneer of awk. Another reason is that I knew that Python's re module had a feature that made this filter very easy, which is that re.sub() can take a function as the replacement instead of a string.

Using a replacement function meant that I could write a simple function that took a match object that was guaranteed to be all decimal digits and turn it into a humanized number (in string form). Then the main look was just:

rxp = re.compile("\d+")
def process():
  for line in sys.stdin:
    l = line.strip()
    l = rxp.sub(humanize_number, l)

The regular expression substitution does all the work of splitting the line apart and reassembling it afterward. I only need to feed lines in and dump them out afterward.

(My regular expression here is a bit inefficient; I could make it skip all one, two, and three digit numbers, for example. That would also keep it from matching numbers in identifiers, eg if a file had a line like 'fred1 100000'. For my purposes I don't need to be more precise right now, but a production version might want to be more careful.)

Python's regular expression function substitution is a handy and powerful way to do certain sorts of very generalized text substitution in a low hassle manner. The one caution to it is that you probably don't want to use it in a performance sensitive situation, because it does require a Python function call and various other things for each substitution. The last time I looked, pure text substitutions ran much faster if you could use them. Here, not only is the situation not performance sensitive but there's no way out of running the Python code one way or another, because we can't do the work with just text substitution (at least not if we want powers of two humanized numbers, as I do).

Sidebar: The humanization function

I started out writing the obvious brute force if based version and then realized that I could get much simpler code by being a bit more clever. The end result is:

KB = 1024
MB = KB * 1024
GB = MB * 1024
TB = GB * 1024

seq = ((TB, 'T'), (GB, 'G'), (MB, 'M'), (KB, 'K'))

def humanize_number(mtch):
  n = int(mtch.group())
  for sz, ch in seq:
    if n >= sz:
      return '%.1f%s' % (n / sz, ch)
  return str(n)

The seq tuple needs to be ordered from the largest unit to the smallest, because we take the first unit that the input is equal to or larger than.

RegexpFunctionSubstitutionWin written at 21:42:00; Add Comment


What is our Python 2 endgame going to be?

Every so often I think about the issue of what our eventual Python 2 endgame is going to be at work. We're going to reach some sort of endgame situation sooner or later; for example, Ubuntu has already removed support for /usr/bin/python being Python 2, although you can still do it by hand. Someday they (and other people) may mandate that /usr/bin/python is Python 3, or remove Python 2 packages entirely, or both. What are we going to do when things reach that state?

There are two sides of this; what we're going to do about our own scripts that are still using Python 2, and what will happen with our users and their scripts. For our own scripts, they could could be rewritten to Python 3 or changed to use a different Python interpreter path in their #! line, including PyPy. Since we're in control of them and the timing of any use of an operating system without Python 2, we're at least not going to be blindsided. My tentative guess at our endgame for our own scripts is that we'd probably use PyPy, although we might opt to move them to Python 3 instead.

(There's very little chance that our remaining Python 2 scripts will all conveniently be obsolete by the time CPython 2 is disappearing from Ubuntu and other operating systems. Making them obsolete would probably take a completely rebuilt from scratch new infrastructure.)

For our users, there is both good news and bad news. The good news is that as a university department, we have a certain natural degree of turnover in user population; when someone graduates and leaves, they mostly stop caring about their Python 2 scripts they had here (or moves on to a different postdoc position, or any number of other things). The bad news is that we seem to have a reasonably significant current use of '/usr/bin/python' and we haven't even looked for people who are running '/usr/bin/python2' or some other alias. Some of that usage is probably automated (in cron jobs and the like), and some of it is probably from people who will be around for years to come. In addition, not all usage of Python 2 will be in regularly run scripts (that we can catch through mechanisms like Linux's auditing framework); some of it is probably in scripts that are only run once in a while.

Unless we get lucky and things are deferred for a significant amount of time, changing /usr/bin/python (to remove it or to be Python 3) or removing Python 2 seems likely to catch a number of our users out. We probably can't find all of them in advance, or get all of them to change things even if we do find them and notify them. Some number of them will probably have long-standing scripts blow up. To reduce problems here we should probably start moving now to discourage use of Python 2 (and identify people using it).

If it's possible, the least disruptive endgame would be to continue having /usr/bin/python and CPython 2 (in the usual places), even if we provide it ourselves. However, keeping the '/usr/bin/python' name working may hamper efforts to herd people away from Python 2; at some point in the endgame, we may want to remove it or let it become Python 3. While we can use PyPy 2 for our own scripts, it's not a drop in replacement for CPython and some programs definitely fail with PyPy when they'd work with CPython.

(Also, I'm not absolutely sure that PyPy will still have a Python 2 version in, say, ten years. Yes I am considering that far into the future.)

A more disruptive endgame would be Ubuntu insisting that /usr/bin/python be Python 3 and no longer supplying Python 2 at all. If we have relatively few people using an explicit '/usr/bin/python2', we might drop our official support for CPython 2 entirely. Hopefully Ubuntu would still supply a PyPy 2, so people would have some option other than migrating their scripts to Python 3.

A third endgame would be the 'excise the remnants' option. When Ubuntu drops Python 2 entirely, we would as well regardless of the remaining use; we wouldn't hand build CPython 2 ourselves or anything. We would handle our own scripts in some way, and other people would be left on their own, with at best us installing the Ubuntu version of PyPy 2 if one existed. This endgame is the most disruptive to people but in some way the most coherent and least work for us in the long run.

PS: Fedora forced /usr/bin/python to be Python 3 a while back, and honestly it's been a good thing overall for me. I had to change some scripts in a hurry, but after that it's nice that running 'python' gets me the version I want and so on. And it's a good way to push me to use Python 3 instead of Python 2.

ConsideringOurPython2Endgame written at 22:36:56; Add Comment

By day for June 2022: 17 18; before June; after June.

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.