Wandering Thoughts archives

2011-09-21

I've surrendered on utm_* query parameters in URLs

I've written before about the various extra utm_* query parameters that show up on a lot of URLs these days (it turns out that they are apparently due to a Google product called Urchin). Back then I was optimistic that a change in the Planet Sysadmin Twitter feed would make them go away. It turns out, well, no such luck. These days, all sorts of things 'helpfully' add these query parameters to URLs that are shared through the various parts of the social web; in fact, my strong impression is that it's hard to share links without this happening. I've continued to see a steady dribble of such requests every day and while many of them are from various Twitter-related robots, some of them show every sign of being from real human beings.

(It looks as if any time a URL is mentioned for the first time in anyone's Twitter stream some number of robots wake up and poke it, mostly with HEAD requests.)

I've disliked these query parameters every since I saw them. They're a hack and fundamentally wrong and they only work because almost every web server in the world is terribly sloppy (per the original discussion). I stuck to my guns on this for a long time. But. Real people are out there innocently using stuff that feeds them URLs with these unsightly query strings and trying to see my content, and I was giving them error pages instead. I don't care about robots, but I do care about people (eventually).

So I've surrendered. DWiki now accepts URLs with utm_* query parameters, no matter how annoyed this makes me.

However this isn't a complete surrender, as I'm handling this the right way. If you use a URL with these query parameters, you don't get the page itself. Instead DWiki immediately returns a redirection to the page's proper URL (ie the one without all of the ugly parameters that exist to either track your activities or inflate the nominal influence of various traffic sources, depending on who you ask). This removes all of the utm_ ugliness from the URL that people actually see and ensures that various sorts of web crawlers get the canonical URL for the page instead of seeing duplicate content across several different URLs.

(This is somewhat related to Pinboard's war on Urchin. We have the same distaste of the whole thing, but Maciej writes much better than I do.)

UtmSurrender written at 01:53:57; Add Comment

2011-09-16

Setting the character encoding for HTML form input

Courtesy of reading HOWTO Use UTF-8 Throughout Your Web Stack, I recently rediscovered the form accept-charset attribute (I say 'rediscovered' because I clearly once knew about it since I mentioned it in passing back in 2007).

Setting an explicit accept-charset attribute on your forms solves (in theory) one of those niggling little HTML forms questions, that being 'what character encoding did the browser use for encoding this text the user submitted?' As spelled out in the HTML 4.01 Forms specification, browsers have to honor an explicit value. However, if one is omitted browsers are merely permitted (but not required) to take the character encoding for form submissions from the character encoding of the form's HTML page.

(I don't know why this wasn't mandatory behavior; maybe there were browsers that historically used their default character set or the like.)

According to various references, accept-charset is fully supported by browsers, with the small exception that if you try to use the charset 'ISO-8859-1' some versions of Internet Explorer will decide that you meant 'Windows-1252' instead. I haven't tested to see if an accept-charset that matches your page's charset will cause form submissions to have an explicit charset specified (cf), although I suspect that it doesn't for most browsers.

This isn't going to cause me to immediately update any of my HTML templates (either in DWiki or elsewhere); in practice they work today without specifying accept-charset, at least as far as I can tell. But when I write or update HTML in the future, I'm going to try to remember this and put accept-charset attributes on all of my HTML forms, just to make sure that I get what I'm expecting. It's a good practice, if nothing else, and someday it may save me some annoyance.

(As before, what HTML generally calls 'charsets' are in fact character encodings, not character sets per se.)

(This is one of the entries that I write to get something to stick in my head. Clearly I didn't think accept-charset was all that important back in 2007, and I'm pretty sure I was wrong.)

FormCharsets written at 01:07:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.