2007-03-29
Usability issues with blog URLs
It's surprising how many little things about web usability you discovery when you write your own version of things like blog software; the whole experience can be very educational. Unfortunately, I've mostly discovered these issues by stubbing my toes on them.
For instance, consider the usability issues of weblog URLs. Take what appeared like an initial and obvious idea: having an entry's category appear in its URL. It turns out that this is a really bad idea, and I can boil why down to this:
Never put anything in an URL that you might want to change.
This is because cool URLs don't change. If an entry's category determines its URL, you can't later decide that the entry belongs in a different category and move it without a bunch of annoyance (if at all). This bit me once here, when I decided I really did need a unix category for entries about generic Unix stuff; before then I put such entries in sysadmin, where many of them still are.
(DWiki makes this worse because it's more or less forced to use an entry's URL as its Atom identifier, which has to remain absolutely stable or I spam people's syndication feeds with duplicate entries.)
You can argue that an entry's title can also change, and while that's true the logical conclusion of that thought is something like LiveJournal entry URLs, which are more or less meaningless digit strings. Putting some version of the title in the URL gives people (and sometimes search engines) clues about what they'll find at the URL, and in turn probably entices them to visit.
Another thing I've discovered is how nice it is to put dates in entry URLs. Entry dates are important because most blog content gets stale sooner or later; putting an entry's date in its URL gives people an immediate clue about how current the entry is, even before they visit the page.
(A blog entry's date is also unlikely to change.)
2007-03-22
An irony of conditional GET for dynamic websites
Ironically, dynamic websites are the least served by HTTP conditional GET, because by the time they've worked out whether or not a page's content is the same as what you've already got, they may well have generated the entire content. As a result, the only thing conditional GET really gets many dynamic websites is bandwidth savings, which may go some way to explaining why many dynamic frameworks don't have very good support for it.
(The same is true for HTTP HEAD requests, generally even more so.
Fortunately they're quite rare.)
I think that what really hurts in this is templating languages. If
you know the content structure going in to page generation, you can
have basic elements with precomputed ETag hash values and so on,
but flexible templates mean that you don't know the content structure
until you evaluate the template to some degree. And the more power the
templating language has, the more evaluation you need to do to know the
content structure.
(The easiest implementation of ETag is to get the ETag value by
hashing the page's more or less full text, which means that you have
to generate most of the page's actual text before you can compute the
value to check against the conditional GET's If-None-Match. Static file
webservers get around this partly by making the ETag value out of
easily gotten things like the inode stat() information.)
It's still useful to generate the HTTP headers necessary for conditional GET in your dynamic framework, since it preserves your ability to someday slap in a reverse cache in front of your actual dynamic bits. And once you're generating the headers, you might as well do actual conditional GET too; if nothing else you're saving your users some bandwidth and thus making your site snappier.
2007-03-13
What I currently do to stop comment spam on WanderingThoughts
WanderingThoughts has been pretty free of successful comment spam attempts for a while, so I think it's about time to write up all of the various things I'm currently doing to stop comment spammers.
(I'm not worried about comment spammers reading this and working past my precautions, because I'm confidant that comment spammers don't bother reading the blogs they spam.)
First off, I get a big leg up by being neither popular nor using common software. This basically reduces the comment spammers down to people automatically filling in any form that moves and people spamming completely by hand. Since I can never stop the latter sort of spammer, I only worry about the former sort.
My current precautions:
- I refuse comments entirely from web browsers that don't send a
User-Agent:header or send a User-Agent header that includes the string 'User-Agent:'. Technically I consider them robots, which are blocked from retrieving a variety of URLs, including the 'Add Comment' pages. - the initial Add Comment page doesn't have a 'Post Comment' option
in the comment form; you have to preview your comment before it shows
up. I think that this is the right behavior to encourage in general,
especially since I use nonstandard comment formatting.
(I got this idea from Sam Ruby, although my implementation is simpler than his.)
- to prevent spammers from fetching my comment form from one IP address
and submitting from another, the comment form stamps itself with the
IP address that you previewed from (more or less) and you can only
post the comment from an IP address in the same /24.
(I could have required 'from the same IP address', but I decided that that was too dangerous in the face of proxies and the like.)
- to deal with spam software that fills in every text field it can find,
there's an invisible honeypot field that is supposed to always be
blank; if there's any value in the field, the comment won't post.
For people with
lynxand other browsers that don't deal with CSS, there's text next to the field that tells you to leave it blank.(I got this idea from Ned Batchelder.)
- I refuse comment posts from IP addresses that are in the
SBL or the XBL. At the moment I don't bother checking
the SBL and the XBL for comment previews, mostly because I
want to delay as few things as possible with DNS lookups.
- comments with control characters are refused; this is part anti-spam precaution and part something required in general. (DWiki doesn't allow control characters in actual pages either.)
Technically I also have a content blacklist, but it is quite out of date and thus pointless. I keep it around mostly to have the hooks in the rest of the code.
DWiki is deliberately written so that it has no general way to write files or otherwise record data locally. This means that I can't take various sorts of precautions that require storing local state, like rate-limiting IP addresses or blocking IP addresses that exhibit characteristic bad behaviors.
(Technically I could write code that assumes that caching is turned on and hijack it for various evil purposes, but I'm not going to go there. Plus, there are concurrency issues that the simple caching layer currently gets to ignore.)