Wandering Thoughts archives

2021-01-06

In modern email, it's easy for plaintext and HTML parts to drift apart

I recently read When The Text And Html Disagree (via, itself via), which is about an instance where an email message had an important disagreement between the plaintext part and the HTML part. In this case it was fortunately obvious that something was wrong, but I'm sure there have been less obvious instances.

I believe that one reason this drift happens comes down to that old aphorism that if you don't test it, it's broken. For email with alternate parts, the revised aphorism can be said as “if you don't see it, it's broken”. Modern email clients normally show you the HTML part to start with, and then most make a generally rational decision to make it at least hard to see the plaintext one. So when people look at test versions (or real versions) of such email messages, only the HTML part has to look good in order for the whole thing to seem fine. The unseen text part can quietly rot away, noticed only by unusual people like me who look at the plaintext version.

(You would think that mass email authoring environments would raise an alert if you only edit the HTML portion of a standing mixed-part email, but apparently not.)

I've seen this sort of thing for spam, but When The Text And Html Disagree makes a nice illustration that it's not just spam that suffers from the issue. In the end we probably shouldn't be too surprised about any of this, because keeping multiple things in synchronization is pretty much a hard problem all over. If you want it to work reliably you need to automate it, and automating this sort of update isn't easy.

(Keeping things in sync by hand is extra work, and sooner or later extra work doesn't get done or doesn't get done right. People forget, people make mistakes, people will get to it tomorrow because there's an urgent thing right now, and so on and so forth.)

PS: Given this, the most likely answer to the question in When The Text And Html Disagree is that if there's a disagreement and it's not clear, the HTML part is right and the plaintext one is wrong. It could be that you have a rare email where someone has updated the plaintext part but not the HTML part, but the odds are very good that it's the other way around. The exception to this is if you're in a very unusual environment where most people see the plaintext part instead of the HTML part.

spam/PlaintextAndHTMLDriftApart written at 22:34:37;

Link: ARM support in Linux distributions demystified

ARM support in Linux distributions demystified (via) is just what it says in the title. Since I only just recently learned about things like 'Aarch64' in the process of writing this entry, all of this was timely and useful. It definitely taught me things about ARM floating point and architectures that I didn't already know.

(The discussion on lobste.rs has some useful additional information about stuff.)

links/LinuxARMDemystified written at 11:08:37;

Unix shell pipelines have two usage patterns

I've seen a variety of recommendations for safer shell scripting that use Bash and set its 'pipefail' option (for example, this one from 2015). This is a good recommendation in one sense, but it exposes a conflict; this option works great for one usage pattern for pipes, and potentially terribly for another one.

To understand the problem, let's start with what Bash's pipefail does. To quote the Bash manual:

The exit status of a pipeline is the exit status of the last command in the pipeline, unless the pipefail option is enabled. If pipefail is enabled, the pipeline’s return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully. [...]

The reason to use pipefail is that if you don't, a command failing unexpectedly in the middle of a pipeline won't normally be detected by you, and won't abort your script if you used 'set -e'. You can go out of your way to carefully check everything with $PIPESTATUS, but that's a lot of extra work.

Unfortunately, this is where our old friend SIGPIPE comes into the picture. What SIGPIPE does in pipelines is force processes to exit if they write to a closed pipe. This happens if a later process in a pipeline doesn't consume all of its input, for example if you only want to process the first thousand lines of output of something:

generate --thing | sed 1000q | gronkulate

The sed exits after a thousand lines and closes the pipe that generate is writing to, generate gets SIGPIPE and by default dies, and suddenly its exit status is non-zero, which means that with pipefail the entire pipeline 'fails' (and with 'set -e', your script will normally exit).

(Under some circumstances, what happens can vary from run to run due to process scheduling. It can also depend on how much output early processes are producing compared to what later processes are filtering; if generate produces 1000 lines or less, sed will consume all of them.)

This leads to two shell pipeline usage patterns. In one usage pattern, all processes in the pipeline consume their entire input unless something goes wrong. Since all processes do this, no process should ever be writing to a closed pipe and SIGPIPE will never happen. In another usage pattern, at least one process will stop processing its input early; often such processes are in the pipeline specifically to stop at some point (as sed is in my example above). These pipelines will sometimes or always generate SIGPIPEs and have some processes exiting with non-zero statuses.

Of course, you can deal with this in an environment where you're using pipefail, even with 'set -e'. For instance, you can force one pipeline step to always exit successfully:

(generate --thing || true) | sed 1000q | gronkulate

However, you have to remember this issue and keep track of what commands can exit early, without reading all of their input. If you miss some, your reward is probably errors from your script. If you're lucky, they'll be regular errors; if you're unlucky, they'll be sporadic errors that happen when one command produces an unusually large amount of output or another command does its work unusually soon or fast.

(Also, it would be nice to only ignore SIGPIPE based failures, not other failures. If generate fails for other reasons, we'd like the whole pipeline to be seen as having failed.)

My informal sense is that the 'consume everything' pipeline pattern is far more common than the 'early exit' pipeline pattern, although I haven't attempted to inventory my scripts. It's certainly the natural pattern when you're filtering, transforming, and examining all of something (for example, to count or summarize it).

unix/ShellPipesTwoUsages written at 00:30:46;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.