Using pup to deal with Twitter's increasing demand for Javascript

February 19, 2017

I tweeted:

.@erchiang 's pup tool just turned a gnarly HTML parsing hassle into a trivial shell one liner. Recommended. https://github.com/ericchiang/pup

I like pup so much right now that I want to explain this and show you what pup let me do easily.

I read Twitter through a moderately Rube Goldberg environment (to the extent that I read it at all these days). Choqok, my Linux client, doesn't currently support new Twitter features like long tweets and quoted tweets; the best it can do is give me a link to read the tweet on Twitter's website. Twitter itself is increasingly demanding that you have Javascript on in order to make their site work, which I refuse to turn on for them. The latest irritation is a feature that Twitter calls 'cards'. Cards basically embed a preview of the contents of a link in the tweet; naturally they don't work without JavaScript, and naturally Twitter is turning an increasing number of completely ordinary links into cards, which means that I don't see them.

(This includes the Github link in my tweet about pup. Good work, Twitter.)

If you look at the raw HTML of a tweet, the actual link URL shows up in a number of places (well, the t.co shortened version of it, at least). In a surprise to me, one of them is in an actual <a> link in the Tweet text itself; unfortunately, that link is deliberately hidden with CSS and I don't currently have a viable CSS modification tool in my browser that could take that out. If we want to extract this link out of the HTML, the easiest place is in a <div> that has the link mentioned as a data-card-url property:

<div class="js-macaw-cards-iframe-container initial-card-height card-type-summary"
[...]
data-card-url="https://t.co/LEqaB79Lbg"
[...]

All we have to do is go through the HTML, find that property, and extract the property value. There are many ways to do this, some better than others; you might use curl, grep, and sed, or you might write a program in the language of your choice to fetch the URL and parse through the HTML with your language's HTML parsing tools.

This is where Eric Chiang's pup tool comes in. Pup is essentially jq for HTML, which means that it can be inadequately described as a structured, HTML-parsing version of grep and sed (see also). With pup, this problem turns into a shell one-liner:

wcat "$URL" | pup 'div[data-card-url] attr{data-card-url}'

The real script that uses this is somewhat more than one line, because it actually gets the URL from my current X selection and then invokes Firefox on it through remote control.

I've had pup sitting around for a while, but this is the first time I've used it. Now that I've experienced how easy pup makes it to grab things out of HTML, I suspect it's not going to be the last time. In fact I have a hand-written HTML parsing program for a similar job that I could replace with a similar pup one-liner.

(I'm not going to do so right now because the program works fine now. But the next time I have to change it, I'll probably just switch over to using pup. It's a lot less annoying to evolve and modify a shell script than it is to keep fiddling with and rebuilding a program.)

PS: via this response to my tweet, I found out about jid, which is basically an interactive version of jq. I suspect that this is going to be handy in the future.

PPS: That the URL is actually in a real <a> link in the HTML does mean that I can turn off CSS entirely (via 'view page in no style', which I have as a gesture in FireGestures because I use it frequently. This isn't all that great, though, because a de-CSS'd Tweet page has a lot of additional cruft on it that you have to scroll through to get to the actual tweet text. But at least it's an option.

Sidebar: Why I don't have CSS mangling in my Firefox

The short version is that both GreaseMonkey and Stylish leak memory on me. I would love to find an addon that doesn't leak memory and enables this kind of modification (here I'd like to strip a 'u-hidden' class from an <a href=...> link), but I haven't yet.

Written on 19 February 2017.
« robots.txt is a hint and a social contract between sites and web spiders
Some views on the Corebird Twitter client »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Feb 19 01:37:09 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.