Wandering Thoughts archives

2022-07-30

Python is my default choice for scripts that process text

Every so often I wind up writing something that needs to do something more complicated than can be readily handled in some Bourne shell, awk, or other basic Unix scripting tools. When this happens, the language I most often turn to is Python, and especially Python is my default choice when the work I'm doing involves processing text in some way (or often if I need to generate text). For example, if I want to analyze the output of some command and generate Prometheus metrics from it, Python is often my choice. These days, this is Python 3, even with its warts with handling non-Unicode input (which usually don't come up in this context).

(A what a lot of these programs do could be summarized as string processing with logic.)

In theory there's no obvious reason that my language of choice couldn't be, say, Go. But in practice, Python has much less friction than something like Go while still having enough structure and capabilities to be better than a much more limited tool like awk. One part of this is Python's casualness about typing, especially typing in dicts. In Python, you can shove anything you want into a dict and it's completely routine to have dicts with heterogenous values (usually your keys are homogenous, eg all strings). This might be madness in a large program, but for small, quickly written things it's a great speedup.

(Some of the need for this can be lessened with dataclasses or attrs. And Python lets you scale up from basic dicts to those, or to basic classes used as little more than records, as you decide they make your code simpler.)

Another area where Python reduces friction is in the lack of explicit error handling while still not hiding errors; exceptions insure that while you may not deal with errors well, you will deal with them one way or another. Again this isn't necessarily what you want in a bigger, more structured program, but in the small it's quite handy to not have to ornament every 'int(...)' or whatever with some sort of error check.

In general, Python is (surprisingly) good at pulling strings apart, shuffling them around, and putting them back together, while still staying structured enough to let me follow what the code does even when I come back to it later. Compact, low ceremony inline string formatting is often quite useful (I use '%' because I'm old fashioned).

Python certainly isn't the only language that can be used in this way; Perl and Ruby are two other obvious examples, and more modern people would probably reach for Javascript. But Python is the one that I've wound up latching on to and sticking with.

I do find it a bit amusing and ironic that despite all of the issues in Python 3 with Unicode and IO (and my gripes surrounding that), it's what I normally use for processing text. In theory, I risk explosions; in practice, it works because I'm in a UTF-8 capable environment with well formed input (often just plain ASCII, which is the most common case for log files and command output).

PythonForStringHandling written at 22:09:30;

By day for July 2022: 30; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.