Wandering Thoughts archives

2006-06-28

The problem with cool URLs

Back around 1999, Tim Berners-Lee wrote Cool URIs don't change, which is about how your URLs shouldn't change (and he gave advice on how to manage it). People have been nodding sagely every since (I hope, since it's a good idea). But there's a problem.

The problem with cool URLs not changing is that it means that your URLs are forever. This means that you either have to get the URL right before you publish it for the first time (leading to taxonomy issues among other things), or you have to keep supporting the old crufty ugly URLs forever (even if they just give HTTP redirections to the new URLs).

Most people aren't going to get their URLs right the first time around, because structuring information is not a small and simple matter (despite how it looks; ask a librarian about it sometime).

Supporting old URLs is a deadweight on your web environment; it's a kind of clutter. Clutter makes things harder to maintain and to keep track of. (Plus you have to actively avoid namespace collisions between new URL schemes and old URLs, which may constrain what sort of new schemes you can use.)

Ironically, you can argue that the best long term approach is more or less meaningless URLs, plus searching and navigation to let people find things. When a URL doesn't mean anything to start with, there's no temptation to change it because you've realized that the meaning is wrong.

CoolUrlProblem written at 03:32:54; Add Comment

2006-06-15

How to have your web spider irritate me intensely

It's very simple: put what should be in your User-Agent header into the Referer header instead. The next time I read my Referer logs, you're sure to provoke me into spasms of teeth-grinding irritation. I can only conclude that people pulling this stunt are attempting advertising through other people's public Referer logs.

(For bonus points, fetch my syndication feeds without any attempt at conditional GET.)

Today's offender is the 'Strategic Board Bot', run by strategicboard.com from the IP address 212.143.103.125 (a netvision.net.il IP address, but also where 'www.strategicboard.com' et al points). Since they aren't fetching our robots.txt either, they've earned an immediate listing in our kernel IP filters.

Strategic Board itself has no useful information in its WHOIS record and appears to be in the business of indexing and searching blogs (which makes their non-use of conditional GET all the more serious; anyone specifically pulling syndication feeds should be using it). Of course, they have no 'how to contact us about our robot' information that I can see in what poking at their web page I'm willing to do.

Strategic Board also wins extended bonus points because they didn't used to do this; they apparently just started yesterday. So they deliberately decided to 'advertise' by hijacking Referer and putting a mere 'SB' into their User-Agent string. (A couple of early requests had 'HTTP Remote File Test' as the User-Agent instead.)

HowToGetYourSpiderBannedIII written at 14:20:41; Add Comment

2006-06-13

A web validation aphorism

There's a famous folk-rule on Usenet and mailing lists to the effect that a spelling flame usually has at least one spelling mistake itself. There seems to be a related rule on the web, which I will put this way:

Websites that boast about validating often don't.

(The corollary is that a certain amount of people agitating for valid web sites don't have valid web sites.)

Now, HTML validation is certainly picky (arguably more picky than spelling), but I'd like to think that people who care enough to stick a badge on their website care enough to run things through a validator. Apparently not, though. (And agitating for web standards while not following them is just ironic.)

One possible reason for the problem is that websites change, creating opportunities for validation errors to creep in. And certainly HTML is hard to write by hand; I suspect that few websites are created with tools that guarantee validation all the time.

(While WanderingThoughts usually validates, this is mostly because its HTML is automatically generated and thus any invalid bits are generally the sign of a programming error that I want to step on. And it only validates to HTML 4.01 transitional, a relatively loose standard to aim for.)

Sidebar: an extreme of valid HTML

Not only is valid HTML picky, it's tricky. For example, here is a rather extreme example of perfectly valid but quite twisted minimal HTML. It's startling how much markup you can leave out and still be legal.

ValidationAphorism written at 03:08:21; Add Comment

2006-06-01

The cynical take on nofollow

People have been calling the nofollow tag a failure for a while, most recently in Blog Spam, and a 'nofollow' Post-Mortem, which hit the geek news today. Comment spam has not exactly diminished since nofollow's adoption, which is not a good result for something introduced with the title 'Preventing comment spam'. Calling it a failure is pretty easy.

The cynical take on nofollow is a bit different: nofollow is actually a clever way to improve search engine results. While it's not going to stop comment spam unless almost everyone adopts it (including neglected bulletin boards and so on), just getting the really hot people to use nofollow helps the search engines a lot, because it's links from the really hot people that matter the most.

Persuading a few hot bloggers is a lot easier than persuading the world, especially because those hot bloggers are also hot comment spam targets and are probably quite interested in anything that can help them out. As a good symbiotic relationship, nofollow helps the hot bloggers by making it less useful to spam them and thus hopefully reducing the amount of comment spam they get.

From the cynical view, nofollow has almost certainly been a roaring success.

Despite all of this, I still like nofollow. Not for the anti-comment-spam properties (if there are any in practice); indeed, I don't bother using nofollow on links in comments here. But when search engines actually respect nofollow, it's a very useful way for steering them away from pages I don't want them requesting or indexing.

(Disclaimer: the author is not necessarily a cynic.)

CynicalNofollow written at 02:36:05; Add Comment

By day for June 2006: 1 13 15 28; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.