Real (email) HTML can get a bit extreme

October 31, 2023

Over on the Fediverse, I noted a discovery I'd made recently:

It turns out that if you nest a couple hundred <div>s inside each other before you get to the actual content text, GNU Emacs shr (its simple HTML renderer) can't cope with the result and gives you no content.

Guess what some HTML-capable email clients do (possibly after the email is repeatedly mutated and resent).

I wonder if I dare report this as an Emacs shr bug. Reproduction is simple, at least.

It turns out that this is really a libxml2 safety feature, which gives me at least two lessons learned here.

The first, obvious lesson is that HTML authoring environments can do weird and extreme things, especially when people repeatedly re-edit and modify something (which is what I believe happened in the case of this particular email). Not only did the authoring environment insert all of these hundreds of <div>s (perhaps bit by bit over time), but it didn't try to collapse them even though basically all of them were redundant. Since this was email there was no CSS involved to complicate the picture of what is and isn't redundant in your HTML structure, but then again the HTML editing component was probably inherited from a web context, where the number of nested divs might actually matter for CSS selectors.

The less obvious lesson is that HTML parsing environments can have their own limits in what sort of extreme HTML they'll accept, and it may not be obvious when you hit them. It especially may not be obvious if you're using the HTML parsing environment through some additional API layers, such as an it being exposed through an Emacs Lisp function or a Python package. High quality HTML parsing and DOM building is enough work that people don't like reimplementing it themselves, especially in relatively slow languages. In theory, it's better for most people to rely on one of a few well proven, carefully developed, comprehensive libraries, even if you need an API bridge or two.

I don't know what to do about the second lesson except bear it in mind if I'm ever parsing HTML in a context where it really matters and it won't be obvious if something has gone wrong. And the first lesson points me to using a well proven HTML parsing library, since such libraries are the most likely to cope with the weird HTML you can find out in the world.


Comments on this page:

It's too bad Mastodon itself, v3 and on, stopped supporting HTML at all. It's just another web application now and lacking it's own nitter, ends up worse than twitter itself. There's basically no way to read a mastodon thread as text in HTML. I tried appending .rss to the https://mastodon.social/@cks.rss url but that did not contain the text either.

Written on 31 October 2023.
« Finding which NFSv4 client owns a lock on a Linux NFS(v4) server
People do change what a particular version is of a Go module »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Oct 31 22:23:59 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.