Humanizing numbers in Python through a regexp substitution function

June 18, 2022

Recently I was looking at files that contained a bunch of sizes in bytes with very widely varying magnitudes, something like this:

file 10361909248
percpu 315360
inactive_file 8666644480
active_file 1695264768
slab_reclaimable 194324760
slab 194324760
[...]

(This is from Linux cgroup memory accounting.)

I find it hard to look at these numbers and have any feel for how big they are in absolute or relative terms, especially if I don't want to spend a lot of time thinking about it. It's much easier for me to read these numbers if they're humanized into things like '9.7G', '308.0K', and '185.3M'. To make these files more readable, I wrote a Python filter program to replace these raw byte counts with their humanized versions.

One reason I used Python for this filter is that it's my default choice for Unix text processing that requires more than sed or a light veneer of awk. Another reason is that I knew that Python's re module had a feature that made this filter very easy, which is that re.sub() can take a function as the replacement instead of a string.

Using a replacement function meant that I could write a simple function that took a match object that was guaranteed to be all decimal digits and turn it into a humanized number (in string form). Then the main look was just:

rxp = re.compile("\d+")
def process():
  for line in sys.stdin:
    l = line.strip()
    l = rxp.sub(humanize_number, l)
    print(l)

The regular expression substitution does all the work of splitting the line apart and reassembling it afterward. I only need to feed lines in and dump them out afterward.

(My regular expression here is a bit inefficient; I could make it skip all one, two, and three digit numbers, for example. That would also keep it from matching numbers in identifiers, eg if a file had a line like 'fred1 100000'. For my purposes I don't need to be more precise right now, but a production version might want to be more careful.)

Python's regular expression function substitution is a handy and powerful way to do certain sorts of very generalized text substitution in a low hassle manner. The one caution to it is that you probably don't want to use it in a performance sensitive situation, because it does require a Python function call and various other things for each substitution. The last time I looked, pure text substitutions ran much faster if you could use them. Here, not only is the situation not performance sensitive but there's no way out of running the Python code one way or another, because we can't do the work with just text substitution (at least not if we want powers of two humanized numbers, as I do).

Sidebar: The humanization function

I started out writing the obvious brute force if based version and then realized that I could get much simpler code by being a bit more clever. The end result is:

KB = 1024
MB = KB * 1024
GB = MB * 1024
TB = GB * 1024

seq = ((TB, 'T'), (GB, 'G'), (MB, 'M'), (KB, 'K'))

def humanize_number(mtch):
  n = int(mtch.group())
  for sz, ch in seq:
    if n >= sz:
      return '%.1f%s' % (n / sz, ch)
  return str(n)

The seq tuple needs to be ordered from the largest unit to the smallest, because we take the first unit that the input is equal to or larger than.

Written on 18 June 2022.
« What is our Python 2 endgame going to be?
What fast SSH bulk transfer speed (probably) looks like in mid-2022 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 18 21:42:00 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.