The fun and charm of quoting URLs properly

April 10, 2006

The fun and charm of URL quoting is that you need to do it twice. Differently. That's because there's two different entities involved: browsers and web servers.

Strictly speaking, about the only thing that you have to quote for the browser is quote characters, because otherwise your nice <a href="..."> comes out very confusing. If you are being a good web standards monkey you need to quote at least ampersands (&'s) as well, because otherwise the browser may take them as entity references. The HTML 4.01 spec in section 5.3.2 recommends also quoting '>', just in case.

(In practice, no browser pays any attention to anything except a truly valid entity reference, because practically everyone except the obsessively standards compliant has unescaped &'s flying around.)

Web servers are startlingly liberal, so the only things you really have to quote is space characters (as either %20 or '+', depending on context) and the percent character itself. RFC 2396 has an additional list or two of stuff that should also be quoted (in sections 2.4.3 and 2.2), like quotes, and some web servers are picky.

(And if you are unlucky enough to deal with a joker who embedded URL component separator characters like '?' or '&' into his paths, you'll have to quote them too.)

You quote things for the browser with entity encoding, so & turns into &amp;. You quote things for the web server with percent encoded hex character values, so a quote turns into %22 and the browser ignores it too. In theory a neurotic application like DWiki that gets handed a URL with a quote should encode it as &quot; so it survives the browser and gets passed as is to the web server for the web server to puke on if desired; in practice, DWiki just encodes quotes in URLs straight to %22s.

Also in practice, many browsers will perform all of the necessary percent encoding for the web server themselves, turning spaces into %20 and so on, and you only need to worry about getting it to the browser. The one gotcha is that browsers often trim trailing spaces, which might be a necessary part of the URL. Doing more quoting is friendlier to simplistic HTML parsing applications.

(This entry is brought to you by me getting curious about the technical requirements of this all during an online discussion with friends.)


Comments on this page:

By DanielMartin at 2006-04-10 22:52:31:

URL-quoting gets hairy and nasty, but this isn't the hairy and nasty bit. This only looks hairy and nasty because you're conflating two operations (URL -> text and plain text -> HTML attribute value). I thought that DWiki cleanly separated those two logical operations, but maybe not.

Note also that encoding > as &gt; doesn't actually come up when encoding URLs, because all proper URLs don't have > in their textual representation. It does however come up in other html attributes, such as the alt="" bit for images.

Where URL quoting gets really hairy is when you have to deal with naive URL-normalization functions, such as the one found in wget as recently as two years ago, though is seems to be fixed now. That routine normalized URL-escaping by doing this:

  1. take the URL, and split it at the first '?' character, if it has one.
  2. To each half, undo all % escapes, then redo the ones that are necessary.
  3. Put the halves back together with a '?' in between.

The problem with this is that it turns the URL http://snowplow.org/smml/?q=easter+in+uses+%26+sorrow+in+title into the URL http://snowplow.org/smml/?q=easter+in+uses+&+sorrow+in+title (which get very different results)

I believe that the impetus for fixing this error was finally that the java sdk download page used at one point URLs like that (i.e. with %26 sequences that it was bad to undo) as part of the downloading process, meaning that it couldn't be downloaded with wget.

This leads to the startling conclusion that given two strings A and B that are both valid textual representations of URLs, and given that string A can be turned into string B by turning a few of string A's characters into %-escaped sequences, you do not know whether A and B point to the same URL. Instead, you have to consider that some characters are special in certain portions of a URL, and that %-encoding removes this specialness. For example, in URLs whose schema follow the common heirarchical format - i.e. http:, ftp:, and file: URLs - the character "/" is special before the "?", but not after. This means that replacing / with %2F on the left hand side of the "?" will change the URL, but doing the same on the right hand side won't. Presumably, different URL schemas could define totally arbitrary rules for what characters are special where. (Though see the URL/URI RFCs for the gory details)

Finally, I'll note that the authors of most decent CGI frameworks allow clients to separate parameters with ; instead of & so that the resultant urls won't have to go through extra backflips to be embedded as html element attributes. (one of the HTML recommendations even urges people to do this where possible) However, since & is what browsers use to separate parameters, that's the only separator you can guarantee will work. This is certainly one area where the architects of HTTP screwed up; semicolon should have been the character browsers used to separate form elements.

By DanielMartin at 2006-04-10 23:00:33:

Oh, also if you're trying to quote some URL programmatically, you don't need to worry about %20 vs. + for spaces. Just always use %20 - that's always a space, whereas + is a space only in the value portion of name=value pairs following the "?" of an http URL submitted by a browser posting a "GET" method form.

By cks at 2006-04-11 00:45:39:

Well, the URL -> plain text -> HTML thing is two distinct steps, not one; in my opinion, that's a good part of the complication, since they have different quoting schemes. When encoding down a URL you have to remember to do both passes, and what each requires.

(Indeed I think that the most common causes of '&' appearing unquoted in URLs in <a href="..."> is because people forget the need for the second sort of quoting.)

When writing a program that generates HTML automatically you have the additional complication of trying to figure out how quoted the URLs you're getting handed already are. For example, if you see a URL with 'http://foo/%20bar/', is that a properly quoted 'http://foo/ bar/', or is it something that needs the % to be escaped? I think most programs just rely on social conventions, mostly 'URLs are already almost completely properly escaped'. (This is what DWiki does.)

DWiki does a single-pass transformation of textual URL representation into HTML form that both entity-encodes '&' and percent-encodes quotes, spaces, and '>' (the latter two to be friendly). Probably DWiki could stand to have a more rubust URL handling infrastructure, as right now putting exotic characters in filenames will cause a certain amount of heartburn.

From 24.8.170.42 at 2006-04-15 21:04:54:

Using "+" rather than "%20" for space is a requirement of the application/x-www-form-urlencoded media type, which originates in the HTML specs and is designed for mapping HTML form field names and values into a vaguely readable but URL-safe format as ASCII bytes. It doesn't matter if you're transmitting via HTTP GET or POST or email.

Until the publication of XForms 1.0 (which is ultimately a component of the yet-unfinished XHTML 2.0), nobody ever bothered to say exactly how any non-ASCII characters in the form field names and values are represented in applicaton/x-www-form-urlencoded data. Consequently, sender and receiver have to agree on the assumptions they make when creating and processing application/x-www-form-urlencoded strings. UTF-8 is what XForms calls for using as the basis of percent-encoding, but you'll still find a lot of servers in the wild that expose form data under the assumption it iso-8859-1 or some other platform default encoding had been used. It also depends on how the data is being exposed in whatever API you're using; if it's just being exposed as byte arrays rather than characters, then it may be a moot point.

Since this media type is oriented toward the representation of character data and has long been vaguely spec'd and interpreted, it is generally unsuitable for arbitrary binary data, which is why the multipart/form-data media type is preferred for 'file upload' types of submissions.

- Mike Brown

By cks at 2006-04-16 00:21:20:

As a pragmatic matter, I suspect that very few web applications treat '+' any differently from %20 in URL query parameters. Fixing it right in an HTML generating engine pretty much requires breaking down the URLs, unquoting all of their quoting, and then trying to put it back together again; in my opinion, down that road lies madness, or at least strange problems. So it's not a DWiki issue that keeps me up at night.

(I'd be somewhat surprised by a POST form processing environment that treated '+' differently than %20, too. But god knows peculiar programming abounds.)

Character set encoding in POSTed forms is an entirely separate issue, one not addressed in this entry. (My apologies if that isn't clear; I was talking about URLs in HTML and as they appear in GET, HEAD, POST, etc HTTP requests.)

(Strictly speaking an authoring environment should worry about character encoding in URLs too, but down that road lies IDNA and other issues I know not enough about.)

Written on 10 April 2006.
« Weekly spam summary on April 8th, 2006
xiostat: better Linux disk IO statistics »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 10 01:49:00 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.