Humanizing numbers in Python through a regexp substitution function
Recently I was looking at files that contained a bunch of sizes in bytes with very widely varying magnitudes, something like this:
file 10361909248 percpu 315360 inactive_file 8666644480 active_file 1695264768 slab_reclaimable 194324760 slab 194324760 [...]
(This is from Linux cgroup memory accounting.)
I find it hard to look at these numbers and have any feel for how big they are in absolute or relative terms, especially if I don't want to spend a lot of time thinking about it. It's much easier for me to read these numbers if they're humanized into things like '9.7G', '308.0K', and '185.3M'. To make these files more readable, I wrote a Python filter program to replace these raw byte counts with their humanized versions.
One reason I used Python for this filter is that it's my default
choice for Unix text processing that requires more than
sed or a
light veneer of
awk. Another reason is that I knew that Python's
re module had a feature
that made this filter very easy, which is that re.sub() can take a
function as the replacement instead of a string.
Using a replacement function meant that I could write a simple function that took a match object that was guaranteed to be all decimal digits and turn it into a humanized number (in string form). Then the main look was just:
rxp = re.compile("\d+") def process(): for line in sys.stdin: l = line.strip() l = rxp.sub(humanize_number, l) print(l)
The regular expression substitution does all the work of splitting the line apart and reassembling it afterward. I only need to feed lines in and dump them out afterward.
(My regular expression here is a bit inefficient; I could make it skip all one, two, and three digit numbers, for example. That would also keep it from matching numbers in identifiers, eg if a file had a line like 'fred1 100000'. For my purposes I don't need to be more precise right now, but a production version might want to be more careful.)
Python's regular expression function substitution is a handy and powerful way to do certain sorts of very generalized text substitution in a low hassle manner. The one caution to it is that you probably don't want to use it in a performance sensitive situation, because it does require a Python function call and various other things for each substitution. The last time I looked, pure text substitutions ran much faster if you could use them. Here, not only is the situation not performance sensitive but there's no way out of running the Python code one way or another, because we can't do the work with just text substitution (at least not if we want powers of two humanized numbers, as I do).
Sidebar: The humanization function
I started out writing the obvious brute force
if based version and
then realized that I could get much simpler code by being a bit more
clever. The end result is:
KB = 1024 MB = KB * 1024 GB = MB * 1024 TB = GB * 1024 seq = ((TB, 'T'), (GB, 'G'), (MB, 'M'), (KB, 'K')) def humanize_number(mtch): n = int(mtch.group()) for sz, ch in seq: if n >= sz: return '%.1f%s' % (n / sz, ch) return str(n)
seq tuple needs to be ordered from the largest unit to the
smallest, because we take the first unit that the input is equal to or