Wandering Thoughts archives

2006-07-30

XHTML on the web is for masochists

Web design purists like to talk up XHTML at the moment, but as far as I can tell almost everyone who is trying to do XHTML today is a masochist (or ignorant).

First, Internet Explorer does not support XHTML. Not even IE7 will support XHTML, which means that for all practical purposes you cannot serve only XHTML to visitors; some of them need to get an HTML version instead.

The usual dodge is to serve the same XHTML document as XHTML to browsers that can handle it but as text/html to everyone else. The problem here is that XHTML and HTML have different rules for several areas; creating a XHTML page that will render the same in HTML requires painstaking and awkward contortions.

Changing the Content-Type of a URL on a request by request basis means that your web server needs to do some dynamic stuff on every request, even requests for what would otherwise be static files.

Since the Content-Type varies from person to person, I believe that you need to mark your pages as non-cacheable, to avoid having web caches serve a cached version with the wrong Content-Type to a browser that can't handle it.

And for all of this extra work, what you get is basically equivalent to writing HTML 4.01 strict; it's not as if XHTML gives you more layout power or is easier to write.

(Actually most people are probably ignorant of these issues. This also explains the huge collection of web pages that claim to be valid XHTML but aren't, which would have catastrophic effects if browsers actually believed them, since with XML and XHTML you are supposed to refuse to do anything with the document if it's invalid.)

Some further reading

XHTMLMasochism written at 13:25:40; Add Comment

2006-07-24

Walking away from Slashdot: a story of design

A while back I wrote about the two faces of RSS, in the process of which I held up Slashdot as an example of a site where I preferred the actual site to the syndication feed by a large margin, and why.

I have to change that, because Slashdot has lost me as a regular visitor to their website, and what Slashdot stuff I read nowadays is almost entirely through their RSS feed. It's for the traditional reasons: a website redesign that actually injected 'design'.

Slashdot used to show me the article text (the most important thing) in my preferred font at my preferred text size. In the redesigned Slashdot, they don't; instead they commit the most common problem of setting the important text in a reduced text size. They also force their text to be set in sans-serif (whatever that is in any particular browser), instead of my default font.

I can fix a too-small font size, but the problem is I have to keep fixing it every time around. And that's been enough to push me away, and since the Slashdot RSS feed is not really a good substitute I read a bunch less Slashdot these days. (Some people would say that this is about time, or long since overdue; personally it makes me a bit sad.)

(I am not interested enough to do something with Firefox's GreaseMonkey. Possibly some user CSS stylesheet magic would do it too; perhaps this will be an incentive to learn about that particular obscure Firefox feature. But really, Slashdot has persuaded me not to care.)

Slashdot isn't by any means alone in this sort of stuff; people do this to their websites all the time. At one level I can say I have no idea why, but at another level I suspect I do: people feel that the browser defaults are bad. (Are they? I don't know.)

LeavingSlashdot written at 01:32:06; Add Comment

2006-07-15

A robot wish

I've come to realize that I have a two-part wish about web robots and spiders and so on. To wit:

I wish that there was something that all robots put in the HTTP request headers (perhaps having 'ROBOT' somewhere in their User-Agent string), and then that there was a standard HTTP response code for 'request declined because robots should not crawl this resource'.

Part of the problems of dealing with (well behaved) robots is that the only real robot signature is fetching robots.txt, and even that isn't a sure thing. You can look at User-Agent strings to recognize specific robots, but this doesn't scale and it's reactive, not proactive. (I say it doesn't scale because in the past 28 days, over 100 different robotic-looking User-Agent strings fetched robots.txt here.)

Having a definite robot signature in each request would make all sorts of robot filtering much easier and more reliable (and we wouldn't have to depend on robots.txt to do it, which has problems). And with a specific error response for it, robots could unambiguously know what was going on and behave appropriately.

(You could also avoid having to give away information in robots.txt about exactly what you don't want robots indexing, which can sometimes be very interesting to nosy people.)

At the dawn of the robot era, it would have been pretty easy to introduce at least the per-request robot signature (an extended 'no robots please' status code might have been more challenging). Unfortunately it's too late by now. Still, if you're writing a new web spider I urge you to start a new movement and put 'ROBOT' somewhere in your User-Agent string.

(PS: I'm not suggesting that this mechanism should replace robots.txt; robots.txt is very useful for efficient bulk removals when they can be expressed within its limits. I'd like to have both available.)

ARobotWish written at 02:22:53; Add Comment

2006-07-11

Why nofollow is useful and important

No less a person than Google's Matt Cutts recently spoke up about herding Googlebot and more or less recommend using the noindex meta tag on pages instead of nofollow on links to them (on the grounds that it's more of a sure thing to mark pages noindex than to make sure that all links are marked nofollow).

I must respectfully disagree with this, because in one important respect meta noindex isn't good enough. The big thing that nofollow does that meta noindex can't do is that it makes good web spiders not fetch the target page at all. Which means that you didn't have to send it, and for dynamic pages that you didn't have to generate it.

(This is especially important for heavily dynamic websites that have a lot of automatically generated index pages of various sorts.)

I really don't want to be burning my CPU cycles to generate pages that web spiders will just throw away again; frankly, it's annoying as well as wasteful. This is a good part of why I am so twitchy about spiders respecting nofollow.

(In fact I care more about this than about helping Google reduce redundancy in their indexes, which is one reason why WanderingThoughts has lots of nofollow but no meta noindex. Plus, getting good indexing for a blog-oid thing is much harder than just sprinkling some noindex magic over bits.)

Sidebar: why not robots.txt?

In theory, robots.txt is supposed to be the way to tell web spiders to avoid URLs entirely. However, there are two problems with in practice. First, the format itself is inadequate for anything except blocking entire directory hierarchies. Second, it's the wrong place; the only thing that really knows whether a page should be spidered is the thing generating the page.

UsefulNofollow written at 02:13:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.