Wandering Thoughts archives

2024-03-20

When I reimplement one of my programs, I often wind up polishing it too

Today I discovered a weird limitation of some IP address lookup stuff on the Linux machines I use (a limitation that's apparently not universal). In response to this, I rewrote the little Python program that I had previously been using for looking up IP addresses as a Go program, because I was relatively confident I could get Go to work (although it turns out I couldn't use net.LookupAddr() and had to be slightly more complicated). I could have made the Go program a basically straight port of the Python one, but as I was writing it, I couldn't resist polishing off some of the rough edges and adding missing features (some of which the Python program could have had, and some which would have been awkward to add).

This isn't even the first time this particular program has been polished as part of re-doing it; it was one of the Python programs I added things to when I moved them to Python 3 and the argparse package. That was a lesser thing than the Go port and the polishing changes were smaller, but they were still there.

This 'reimplementation leads to polishing' thing is something I've experienced before. It seems that more often than not, if I'm re-doing something I'm going to make it better (or at least what I consider better), unless I'm specifically implementing something with the goal of being essentially an exact duplicate but in a faster environment (which happened once). It doesn't have to be a reimplementation in a different language, although that certainly helps; I've re-done Python programs and shell scripts and had it lead to polishing.

One trigger for polishing is writing new documentation and code comments. In a pattern that's probably familiar to many programmers, when I find myself about to document some limitation or code issue, I'll frequently get the urge to fix it instead. Or I'll write the documentation about the imperfection, have it quietly nibble at me, and then go back to the code so I can delete that bit of the documentation after all. But some of what drives this polishing is the sheer momentum of having the code open in my editor and already changing or writing it.

Why doesn't happen when I write the program the first time? I think part of it is that I understand the problem and what I want to do better the second time around. When I'm putting together the initial quick utility, I have no experience with it and I don't necessarily know what's missing and what's awkward; I'm sort of building a 'minimum viable product' to deal with my immediate need (such as turning IP addresses into host names with validation of the result). When I come back to re-do or re-implement some or all of the program, I know both the problem and my needs better.

ReimplementationPolish written at 23:10:44;

2024-03-08

A realization about shell pipeline steps on multi-core machines

Over on the Fediverse, I had a realization:

This is my face when I realize that on a big multi-core machine, I want to do 'sed ... | sed ... | sed ...' instead of the nominally more efficient 'sed -e ... -e ... -e ...' because sed is single-threaded and if I have several costly patterns, multiple seds will parallelize them across those multiple cores.

Even when doing on the fly shell pipelines, I've tended to reflexively use 'sed -e ... -e ...' when I had multiple separate sed transformations to do, instead of putting each transformation in its own 'sed' command. Similarly I sometimes try to cleverly merge multi-command things into one command, although usually I don't try too hard. In a world where you have enough cores (well, CPUs), this isn't necessarily the right thing to do. Most commands are single threaded and will use only one CPU, but every command in a pipeline can run on a different CPU. So splitting up a single giant 'sed' into several may reduce a single-core bottleneck and speed things up.

(Giving sed multiple expressions is especially single threaded because sed specifically promises that they're processed in order, and sometimes this matters.)

Whether this actually matters may vary a lot. In my case, it only made a trivial difference in the end, partly because only one of my sed patterns was CPU-intensive (but that pattern alone made sed use all the CPU it could get and made it the bottleneck in the entire pipeline). In some cases adding more commands may add more in overhead than it saves from parallelism. There are no universal answers.

One of my lessons learned from this is that if I'm on a machine with plenty of cores and doing a one-time thing, it probably isn't worth my while to carefully optimize how many processes are being run as I evolve the pipeline. I might as well jam more pipeline steps whenever and wherever they're convenient. If it's easy to move one step closer to the goal with one more pipeline step, do it. Even if it doesn't help, it probably won't hurt very much.

Another lesson learned is that I might want to look for single threaded choke points if I've got a long-running shell pipeline. These are generally relatively easy to spot; just run 'top' and look for what's using up all of one CPU (on Linux, this is 100% CPU time). Sometimes this will be as easy to split as 'sed' was, and other times I may need to be more creative (for example, if zcat is hitting CPU limits, maybe pigz can help a bit.

(If I have the fast disk space, possibly un-compressing the files in place in parallel will work. This comes up in system administration work more than you'd think, since we can want to search and process log files and they're often stored compressed.)

ShellPipelineStepsAndCPUs written at 22:27:42;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.