A thought about Technorati

March 6, 2006

Technorati famously has some problems indexing blogs. I believe that a lot of these issues may come down to something simple: Technorati's real problem is that it predates the syndication feed revolution.

Post syndication revolution blog search engines (like Google Blogsearch, Feedster, IceRocket, and Bloglines) are actually syndication feed search engines. They operate by finding feeds and mining them for the entries (or at least URLs to the entries, if your feed has partial text).

But before the syndication revolution there were no widespread syndication feeds to mine. Instead, you had to spider the blog web pages themselves and then reverse engineer the HTML to try extract the blog entries.

This seems to be how Technorati operates; we can see a hint of this in their publishers help page, where they ask people to add special markup that will give their parser more clues. And as Chris Linfoot has noticed, they don't seem to pull feeds very much.

Technorati can hard-code handling for common blogging sites and blog packages, but in general this sort of heuristic reverse engineering is a hard task that needs continued tweaking. It's not surprising that it's prone to problems; perhaps it's more surprising that it works as well as it does without more help from bloggers.

Fundamentally I think that being a pre syndication era blog search engine is a significant handicap for Technorati. Syndication feeds are simply at least an order of magnitude easier to parse and work with than raw HTML pages; until Technorati does as much as possible with syndication feeds, and only falls back to parsing raw pages as a last resort, it's going to be working harder than competitors like IceRocket for less results and more problems.

Honesty compels me to admit that there's a certain amount of sour grapes in this, because Technorati is barely indexing WanderingThoughts at all. (Yes, I ping them like clockwork when I post. Yes, my HTML and my feed validates (or at least the feed validated back when the feed validator would still validate Atom 0.3 feeds).)

Written on 06 March 2006.
« Weekly spam summary on March 4th, 2006
How not to set up your mail server (part 1) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 6 02:52:15 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.