The fun and charm of quoting URLs properly

April 10, 2006

The fun and charm of URL quoting is that you need to do it twice. Differently. That's because there's two different entities involved: browsers and web servers.

Strictly speaking, about the only thing that you have to quote for the browser is quote characters, because otherwise your nice <a href="..."> comes out very confusing. If you are being a good web standards monkey you need to quote at least ampersands (&'s) as well, because otherwise the browser may take them as entity references. The HTML 4.01 spec in section 5.3.2 recommends also quoting '>', just in case.

(In practice, no browser pays any attention to anything except a truly valid entity reference, because practically everyone except the obsessively standards compliant has unescaped &'s flying around.)

Web servers are startlingly liberal, so the only things you really have to quote is space characters (as either %20 or '+', depending on context) and the percent character itself. RFC 2396 has an additional list or two of stuff that should also be quoted (in sections 2.4.3 and 2.2), like quotes, and some web servers are picky.

(And if you are unlucky enough to deal with a joker who embedded URL component separator characters like '?' or '&' into his paths, you'll have to quote them too.)

You quote things for the browser with entity encoding, so & turns into &amp;. You quote things for the web server with percent encoded hex character values, so a quote turns into %22 and the browser ignores it too. In theory a neurotic application like DWiki that gets handed a URL with a quote should encode it as &quot; so it survives the browser and gets passed as is to the web server for the web server to puke on if desired; in practice, DWiki just encodes quotes in URLs straight to %22s.

Also in practice, many browsers will perform all of the necessary percent encoding for the web server themselves, turning spaces into %20 and so on, and you only need to worry about getting it to the browser. The one gotcha is that browsers often trim trailing spaces, which might be a necessary part of the URL. Doing more quoting is friendlier to simplistic HTML parsing applications.

(This entry is brought to you by me getting curious about the technical requirements of this all during an online discussion with friends.)

Written on 10 April 2006.
« Weekly spam summary on April 8th, 2006
xiostat: better Linux disk IO statistics »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 10 01:49:00 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.