== Using _pup_ to deal with Twitter's increasing demand for Javascript I [[tweeted https://twitter.com/thatcks/status/832826589947817984]]: > .[[@erchiang https://twitter.com/erchiang]] 's pup tool just > turned a gnarly HTML parsing hassle into a trivial shell one > liner. Recommended. https://github.com/ericchiang/pup I like pup so much right now that I want to explain this and show you what pup let me do easily. I read Twitter through a moderately Rube Goldberg environment ([[to the extent that I read it at all these days https://twitter.com/thatcks/status/798211750688604160]]). [[Choqok http://choqok.gnufolks.org/]], my Linux client, doesn't currently support new Twitter features like long tweets and quoted tweets; the best it can do is give me a link to read the tweet on Twitter's website. Twitter itself is increasingly demanding that you have Javascript on in order to make their site work, which I refuse to turn on for them. The latest irritation is a feature that Twitter calls 'cards'. Cards basically embed a preview of the contents of a link in the tweet; naturally they don't work without JavaScript, and naturally Twitter is turning an increasing number of completely ordinary links into cards, which means that I don't see them. (This includes the Github link in [[my tweet about _pup_ https://twitter.com/thatcks/status/832826589947817984]]. Good work, Twitter.) If you look at the raw HTML of [[a tweet https://twitter.com/thatcks/status/832826589947817984]], the actual link URL shows up in a number of places (well, the t.co shortened version of it, at least). In a surprise to me, one of them is in an actual link in the Tweet text itself; unfortunately, that link is deliberately hidden with CSS and I don't currently have a viable CSS modification tool in my browser that could take that out. If we want to extract this link out of the HTML, the easiest place is in a
that has the link mentioned as a _data-card-url_ property: .pn prewrap on >
[...] > data-card-url="https://t.co/LEqaB79Lbg" > [...] All we have to do is go through the HTML, find that property, and extract the property value. There are many ways to do this, some better than others; you might use _curl_, _grep_, and _sed_, or you might write a program in [[the language of your choice https://golang.org/]] to fetch the URL and parse through the HTML with your language's HTML parsing tools. This is where Eric Chiang's [[_pup_ https://github.com/ericchiang/pup]] tool comes in. Pup is essentially [[jq https://stedolan.github.io/jq/]] for HTML, which means that it can be inadequately described as a structured, HTML-parsing version of _grep_ and _sed_ ([[see also https://twitter.com/simonw/status/832969595967328257]]). With _pup_, this problem turns into a shell one-liner: > _[[wcat ../sysadmin/LittleScriptsV]] "$URL" | > pup 'div[data-card-url] attr{data-card-url}'_ The real script that uses this is somewhat more than one line, because it actually gets the URL from [[my current X selection ../unix/MyFirefoxRemoteControl]] and then invokes Firefox on it through [[remote control ../unix/WeirdFirefoxRemoteControl]]. I've had _pup_ sitting around for a while, but this is the first time I've used it. Now that I've experienced how easy _pup_ makes it to grab things out of HTML, I suspect it's not going to be the last time. In fact I have a hand-written HTML parsing program for a similar job that I could replace with a similar _pup_ one-liner. (I'm not going to do so right now because the program works fine now. But the next time I have to change it, I'll probably just switch over to using pup. It's a lot less annoying to evolve and modify a shell script than it is to keep fiddling with and rebuilding a program.) PS: via [[this response to my tweet https://twitter.com/kree10/status/833098975678967808]], I found out about [[jid https://github.com/simeji/jid]], which is basically an interactive version of [[jq]]. I suspect that this is going to be handy in the future. PPS: That the URL is actually in a real link in the HTML does mean that I can turn off CSS entirely (via 'view page in no style', which I have as a gesture in [[FireGestures Firefox37Extensions]] because I use it frequently. This isn't all that great, though, because a de-CSS'd Tweet page has a lot of additional cruft on it that you have to scroll through to get to the actual tweet text. But at least it's an option. === Sidebar: Why I don't have CSS mangling in my Firefox The short version is that both GreaseMonkey and Stylish [[leak memory on me FirefoxAddonsMemoryLeaks]]. I would love to find an addon that doesn't leak memory and enables this kind of modification (here I'd like to strip a 'u-hidden' class from an link), but I haven't yet.