A realization about shell pipeline steps on multi-core machines

March 8, 2024

Over on the Fediverse, I had a realization:

This is my face when I realize that on a big multi-core machine, I want to do 'sed ... | sed ... | sed ...' instead of the nominally more efficient 'sed -e ... -e ... -e ...' because sed is single-threaded and if I have several costly patterns, multiple seds will parallelize them across those multiple cores.

Even when doing on the fly shell pipelines, I've tended to reflexively use 'sed -e ... -e ...' when I had multiple separate sed transformations to do, instead of putting each transformation in its own 'sed' command. Similarly I sometimes try to cleverly merge multi-command things into one command, although usually I don't try too hard. In a world where you have enough cores (well, CPUs), this isn't necessarily the right thing to do. Most commands are single threaded and will use only one CPU, but every command in a pipeline can run on a different CPU. So splitting up a single giant 'sed' into several may reduce a single-core bottleneck and speed things up.

(Giving sed multiple expressions is especially single threaded because sed specifically promises that they're processed in order, and sometimes this matters.)

Whether this actually matters may vary a lot. In my case, it only made a trivial difference in the end, partly because only one of my sed patterns was CPU-intensive (but that pattern alone made sed use all the CPU it could get and made it the bottleneck in the entire pipeline). In some cases adding more commands may add more in overhead than it saves from parallelism. There are no universal answers.

One of my lessons learned from this is that if I'm on a machine with plenty of cores and doing a one-time thing, it probably isn't worth my while to carefully optimize how many processes are being run as I evolve the pipeline. I might as well jam more pipeline steps whenever and wherever they're convenient. If it's easy to move one step closer to the goal with one more pipeline step, do it. Even if it doesn't help, it probably won't hurt very much.

Another lesson learned is that I might want to look for single threaded choke points if I've got a long-running shell pipeline. These are generally relatively easy to spot; just run 'top' and look for what's using up all of one CPU (on Linux, this is 100% CPU time). Sometimes this will be as easy to split as 'sed' was, and other times I may need to be more creative (for example, if zcat is hitting CPU limits, maybe pigz can help a bit.

(If I have the fast disk space, possibly un-compressing the files in place in parallel will work. This comes up in system administration work more than you'd think, since we can want to search and process log files and they're often stored compressed.)

Written on 08 March 2024.
« Some notes about the Cloudflare eBPF Prometheus exporter for Linux
Some thoughts on usage data for your systems and services »

Page tools: View Source.
Search:
Login: Password:

Last modified: Fri Mar 8 22:27:42 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.