2006-01-25
Site navigation stuff goes on the right side
The correct place to put your site navigation is on the right side of your pages, not the left. (This assumes you have enough geegaws that they don't fit in a header or a footer, but then most people do.)
There's two pragmatic reasons why.
The most important is what happens if the page is too wide for the user's browser, namely the browser displays the left however much it can fit in. In this situation, which is more important for the user to see as much of as possible: your navigation or your content? Forcing people to scroll sideways to read what they came to your site for strikes me as a good way to irritate them; it certainly irritates me when I run into sites like this.
(Don't assume that everyone uses wide browser windows. I suspect that larger displays are actually going to lead to smaller browser windows, since full screen and nearly full screen windows become absurdly big on big displays.)
The other reason is that in the portion of the world that reads left to right, we're conditioned to look at the top left, or at least the left, to start reading pages. If the site navigation is on the left, the reader has to skip over it; if the site navigation is on the right, they can immediately hone in on the actual content.
(Three column layouts are an entirely separate issue that I'm not going to go into now except to say that I almost never like them.)
2006-01-21
Please have stable ids for your feed entries
In both RSS and Atom syndication feeds, feed entries can have an identifier element (optional in the case of RSS, mandatory in Atom feeds). The entry ID is supposed to be permanent and stable, no matter what; things that process feeds use it to know what they've seen before and what they haven't.
This might seem like an unimportant picky thing, except that LiveJournal just inadvertently gave everyone reading Planet Debian a glaring example of why it's so important. (And Planet Debian is a default feed in liferea, so that may be a decent number of people.)
It goes like this:
- A number of the people aggregated at Planet Debian use LiveJournal.
- LiveJournal makes the RSS <guid> element the URL of the post, which includes the journal's URL. (Possibly they have to, if too many RSS readers assume that the <guid> is a URL.)
- Due to a security issue, LiveJournal recently changed the URL to everyone's journal.
- All the <guid> elements in people's entries promptly changed.
The result of all of this has been a flood of old posts washing over Planet Debian, bit by bit (LJ feeds only refresh when the user posts a new entry).
I'm sure this isn't deliberate; no one wanted this to happen. But it does make a handy demonstration of why changeable entry identifiers are a bad idea.
Unfortunately DWiki has this problem too, because its only concept of an object's identity is its path and thus its URL, which has caused occasional heartburn when I've been forced to rename entries. However, DWiki is operating under stricter constraints than most web sites; if you're storing any sort of metadata about pages or entries, you can also store some sort of permanent unique identifier.
(Heck, if you store entries in a database, you need a primary key anyways. Even if this is not easily representable in ASCII, it can be hashed down and ASCII-fied.)
2006-01-20
The limits of web spider tolerance
I have an important message for web spider operators: our generosity is not unlimited. In fact, it's probably running out.
There are a lot of web spiders out there, and many of these spiders don't seem to be offering anything for free to the public. When you crawl to build a private index, you're building a business in part off our resources, which you are using for free, and there is very little in this for us. To put it plainly, such spider operators are parasites that are counting on us to not really notice their spider-bites.
Like most websites, we've got a thick skin and large reserves of generosity. But it's not unlimited, and it's already worn out for some people. Moreover, I believe that being a parasite is not a good way to be viable in the long term (and it's certainly not a good way to make people like you).
If you are considering a parasitic spider business today, do the honest and simple thing: buy access to Alexa's data. (If you can't afford this, how on earth are you going to afford the infrastructure to do decent web crawling?)
If you believe you have a non-parasitic spider business, you'd better have a clear and compelling explanation of what's in it for us. What do we, or the general public, get out of letting you consume our resources?
(For a hair-raising list of web spiders and their apparent purposes, see Edith Frost's Spot the Bot entry.)
2006-01-14
Writing HTML considered harmful
The other day a weblog I read had a post (and their RSS feed) blow up because of invalid markup: an entry was quoting some text that had bare '<'s in it, nothing escaped them, and invalid HTML tags got generated and ate part of the entry. (The problem's now been fixed.)
There's nothing noteworthy about this, and that's the problem: people make this mistake all the time. HTML has a bunch of picky rules to keep track of; if you make people write HTML they will overlook something every so often, kaboom. The conclusion is obvious.
Not escaping a '<' is the most common error, so if you're going to
make people write HTML please automatically escape all unrecognized
HTML tags. This gives you a fighting chance of not mangling your
user's text too badly the next time they paste in something with a
'#include <stdio.h>' or whatever. (Please especially do this if
you're already only accepting limited HTML markup, for example in
comments.)
The real solution is to use a markup language that's easier to write and avoids these errors. There's lots of choices; wikis have shown that people will happily write quite a lot in WikiText variants, for example. While these don't give you all of HTML's power, content text rarely needs more than the core markup, and in any case if you're editing through the web there's a limit on what you can write by hand and get right.
You might say 'well, people shouldn't make that error' (or 'people should preview and notice the error and fix it'). Don't. When people make a mistake all the time the error is in not in the people, it's in the interface. (You can maintain otherwise, but you are trying to swim upstream against a very, very strong current.)
Sidebar: but what about accepting unrecognized tags?
Accepting and ignoring unrecognized HTML markup is a great thing for a browser, but it's almost always the wrong thing for a simple authoring environment. For the rare times that your users need to put weird new HTML tags in, have an override option. If you're worried about new HTML tags becoming common, just let people add new HTML tags to the list of known ones.
2006-01-03
In practice, there are multiple namespaces for URLs
In theory, the HTTP and URI/URL standards say that URLs are all in a
single namespace, as opposed to GET, POST, etc all using different
URL namespaces, where some URLs only exist for POST and some only
exist for GET.
In practice, I believe that web traversal software should behave as if
there were two URL namespaces on websites: one for GET and HEAD
requests, and a completely independent one for POST requests.
Crawling software should not issue 'cross-namespace' URL requests,
because you simply can't assume that a URL that is valid in one can
even be used in the other.
This isn't very hard for POST requests; not much software makes
them, and there's lots of things that make sending useful POST
requests off to URLs you've only seen in GET contexts difficult. (In
theory you could try converting GET requests with parameters into
POST form requests with the same parameters, but I suspect this will
strike people as at least dangerous and questionable.)
Unfortunately I've seen at least one piece of software that went the
other way, issuing GET requests for URLs that only appeared as the
target of POST form actions. Since it tried this inside CSpace the
requests went down in flames, because I'm cautious about anything
involving POST (and I get grumpy when things 'rattle the
doorknobs').
(The crawler in question was called SBIder, from sitesell.com, and
this behavior is one reason it is now listed in our
robots.txt.)