Using pup
to deal with Twitter's increasing demand for Javascript
I tweeted:
.@erchiang 's pup tool just turned a gnarly HTML parsing hassle into a trivial shell one liner. Recommended. https://github.com/ericchiang/pup
I like pup so much right now that I want to explain this and show you what pup let me do easily.
I read Twitter through a moderately Rube Goldberg environment (to the extent that I read it at all these days). Choqok, my Linux client, doesn't currently support new Twitter features like long tweets and quoted tweets; the best it can do is give me a link to read the tweet on Twitter's website. Twitter itself is increasingly demanding that you have Javascript on in order to make their site work, which I refuse to turn on for them. The latest irritation is a feature that Twitter calls 'cards'. Cards basically embed a preview of the contents of a link in the tweet; naturally they don't work without JavaScript, and naturally Twitter is turning an increasing number of completely ordinary links into cards, which means that I don't see them.
(This includes the Github link in my tweet about pup
. Good work,
Twitter.)
If you look at the raw HTML of a tweet, the actual
link URL shows up in a number of places (well, the t.co shortened
version of it, at least). In a surprise to me, one of them is in an
actual <a> link in the Tweet text itself; unfortunately, that link
is deliberately hidden with CSS and I don't currently have a viable
CSS modification tool in my browser that could take that out. If we
want to extract this link out of the HTML, the easiest place is in
a <div> that has the link mentioned as a data-card-url
property:
<div class="js-macaw-cards-iframe-container initial-card-height card-type-summary" [...] data-card-url="https://t.co/LEqaB79Lbg" [...]
All we have to do is go through the HTML, find that property, and
extract the property value. There are many ways to do this, some
better than others; you might use curl
, grep
, and sed
, or you
might write a program in the language of your choice to fetch the URL and parse through the HTML
with your language's HTML parsing tools.
This is where Eric Chiang's pup
tool comes in. Pup is essentially jq
for HTML, which means that it can be inadequately described as a
structured, HTML-parsing version of grep
and sed
(see also). With pup
,
this problem turns into a shell one-liner:
wcat "$URL" | pup 'div[data-card-url] attr{data-card-url}'
The real script that uses this is somewhat more than one line, because it actually gets the URL from my current X selection and then invokes Firefox on it through remote control.
I've had pup
sitting around for a while, but this is the first
time I've used it. Now that I've experienced how easy pup
makes
it to grab things out of HTML, I suspect it's not going to be the
last time. In fact I have a hand-written HTML parsing program for
a similar job that I could replace with a similar pup
one-liner.
(I'm not going to do so right now because the program works fine now. But the next time I have to change it, I'll probably just switch over to using pup. It's a lot less annoying to evolve and modify a shell script than it is to keep fiddling with and rebuilding a program.)
PS: via this response to my tweet, I found out about jid, which is basically an interactive version of jq. I suspect that this is going to be handy in the future.
PPS: That the URL is actually in a real <a> link in the HTML does mean that I can turn off CSS entirely (via 'view page in no style', which I have as a gesture in FireGestures because I use it frequently. This isn't all that great, though, because a de-CSS'd Tweet page has a lot of additional cruft on it that you have to scroll through to get to the actual tweet text. But at least it's an option.
Sidebar: Why I don't have CSS mangling in my Firefox
The short version is that both GreaseMonkey and Stylish leak memory on me. I would love to find an addon that doesn't leak memory and enables this kind of modification (here I'd like to strip a 'u-hidden' class from an <a href=...> link), but I haven't yet.
Comments on this page:
|
|