In shell programming, I should be more willing to write custom tools

February 23, 2015

One of the very strong temptations in Unix shell programming is to use and abuse existing programs in order to get things done, rather than going to the hassle of writing your own custom tool to do just what you want. I don't want to say that this is wrong, exactly, but it does have its limits; in a variant of the general shell programming Turing tar pit, you can spend a lot of time banging your head against those limits or you can just write something that is specific to your problem and so does what you want. I have a bias against writing my own custom tools, for various reasons, but this bias is probably too strong.

All of that sounds really abstract, so let me get concrete about the case that sparked this thought. I have a shell script that decides what to do with URLs that I click on in my Twitter client, which is not as simple as 'hand them to my browser' for various reasons. As part of this script I want to reach through the HTTP redirections imposed by the various levels of URL shorteners that people use on Twitter.

If you want to get HTTP redirections on a generic Unix system with existing tools, the best way I know of to do this is to abuse curl along with some other things:

curl -siI "$URL" | grep -i '^location:' | awk '{print $2}' | tr -d '\r'

Put plainly, this is a hack. We aren't actually getting the redirection as such; we're getting curl to make a request that should only have headers, dumping the headers, and then trying to pick out the HTTP redirection header. We aren't verifying that we actually got a HTTP redirect status code and I think that the server could do wacky things with the Location: header as well, and we certainly aren't verifying that the server only gave us headers. Bits of this incantation evolved over time as I ran into limitations in it; both the case-independent grep and the entire tr were later additions to cope with unusual servers. The final nail here is that curl on Fedora 21 has problems talking to CloudFlare HTTPS sites and that affects some specialized URL shorteners I want to strip redirections from.

(You might think that servers will never include content bodies with HEAD replies, but from personal experience I can say that a very similar mistake is quite easy to make in a custom framework.)

The right solution here is to stop torturing curl and to get or write a specialized tool to do the job I want. This tool would specifically check that we got a HTTP redirection and then output only the target URL from the redirect. Any language with a modern HTTP framework should make this easy and fast to write; I'd probably use Go just because.

(In theory I could also use a more focused 'make HTTP request and extract specific header <X>' tool to do this job. I don't know if any exist.)

Why didn't I write a custom tool when I started, or at least when I started running into issues with curl? Because each time it seemed like less work to use existing tools and hack things up a bit instead of going all the way to writing my own. That's one of the temptations of a Turing tar pit; every step into the tar can feel small and oh so reasonable, and only at the end do you realize that you're well and truly mired.

(Yes, there are drawbacks to writing custom tools instead of bending standard ones to your needs. That's something for another entry, though.)

PS: writing custom tools that do exactly what you want and what your script needs also has the side effect of making your scripts clearer, because there is less code that's just there to wrap and manipulate the core standard tool. Three of the four commands in that 'get a redirection' pipeline are just there to fix up curl's output, after all.


Comments on this page:

By Albert at 2015-02-23 03:58:45:

Did you consider curl's -L option to have it follow redirects? With curl -IL you shuld be able to get to the final destination of the redirection chain, then using -w you can extract headers, status code etc.

I think what you want is

   % curl -ILs -o /dev/null -w '%{url_effective}' http://dlvr.it/8dncqZ
   http://www.reddit.com/r/emacs/comments/2wh9o0/can_emacs_replace_evernote/?utm_source=dlvr.it&utm_medium=twitter

(BTW, which Twitter client do you use?)

   curl -siI "$URL"

If you do decide to do more curl torturing in the future, there's a slightly better way to ask it for headers. The problem is that -I sends a HEAD request which servers may treat differently from a GET request (and as you noticed may send more then just headers). Instead we can just ask curl for headers:

   curl -s -D- -o/dev/null

I have this aliased to "headers" on my machine.

(I wish curl had a flag for "don't print interactive status lines, but do print connection errors. '-s' does too much.)

By cks at 2015-02-23 08:11:53:

In my situation I specifically don't want the system to automatically follow a chain of redirects until it stops, as I do special processing on a number of intermediate URLs for my own reasons.

The Twitter client I use is choqok, which has a number of features I find important for how I use Twitter (cf).

Ok, then

   curl -Is -o /dev/null -w '%{redirect_url}' http://dlvr.it/8dncqZ

(Also note --max-redirs in general.)

Written on 23 February 2015.
« Unsurprisingly, random SMTP servers do get open relay probes
How we do and document machine builds »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 23 01:52:07 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.