Wandering Thoughts archives

2006-03-22

Google Desktop and conditional GET (part 2)

When I start grumpily thinking of ways to punish a program's bad behavior is about when it sinks in that I don't like it. Which means that Google Desktop has a problem.

Back in December I wrote about Google Desktop and conditional GET and concluded that I didn't have enough evidence to really know what the heck was going on. Well, I have several months more data and more people using Google Desktop against me, and I've reached a conclusion: Google Desktop doesn't do conditional GET.

Most of the time it doesn't even bother trying, never sending If-Modified-Since: and If-None-Match: headers despite making repeated requests for the same Atom feed. A few clients sent the headers, but never updated them when DWiki returned updated information. One client has been sending an I-M-S header of 'Sun, 19 Feb 2006 10:44:42 GMT' for the past month, all the while getting updates; interestingly, it did change the If-None-Match: header, but now hasn't updated it since February 25th.

Here's a table about the requests since February 21st:

Total requests 304 result 200 result Even tried
3180 5 3175 715

(The 'even tried' category is any request that had an I-M-S or I-N-M, no matter how crazy.)

These are not casual visitors; 19 different IP addresses requested at least one feed ten times or more. The most prolific one asked for one feed 597 times, and never included an I-M-S or I-N-M header.

Not supporting conditional GET when fetching syndication feeds is bad. Not supporting it in something you expect to be widely deployed is really bad. I really had hoped for more from Google.

(As always, the disclaimer: I like having readers and we have plenty of bandwidth. This is absolutely not a request for people using Google Desktop to stop reading WanderingThoughts.)

GoogleDesktopAndCondGetII written at 02:55:31; Add Comment

2006-03-17

The problem with LiveJournal

The problem with LiveJournal is that you can't stop partway through reading your friendslist; unless you have a better memory than me, once you start reading it you need to read all the way back to where you last stopped. The result is somewhat like Space Invaders; a stream of entries comes at you and you have to read them all or die. In turn, this makes reading LiveJournal friendslists not a casual activity; if I can't commit enough time to read all of the new entries that have built up, I have to stay away.

This isn't just LiveJournal's problem; it's the problem with all blogs. LiveJournal has it much worse because the LiveJournal friendslist is an aggregator, so you get lots of volume in one place.

Blogs have to use reverse chronological order mostly because the web is effectively stateless (technically you can use different URLs for different states, but very few visitors will change the URL they use to get to you). With a single URL and without state, you have to land visitors at some arbitrary point in a stream of entries; any of your choices are going to be crappy for someone.

(Blogs aren't alone in having this problem; consider webcomics, where not only may the latest comic not make sense without the previous one, it can even be a serious spoiler.)

One of the big wins of syndication readers is that they do have state, so they can keep track of unread things for me. This makes it possible to dip into a feed, read five or ten entries, and then stop; as a result I am far more up to date with Planet Debian than I am with my LiveJournal reading, despite being much more interested in the latter.

Unfortunately, LiveJournal does not offer friendslists in syndication form. And I suppose that is the real problem with LiveJournal.

(Obligatory attribution darnit: the Space Invaders analogy is due to whoever called nn the Space Invaders of Usenet newsreaders back in the days of yore; at the time it had similar issues.)

LiveJournalProblem written at 02:22:47; Add Comment

2006-03-11

Web design trends that I don't understand (part 1)

Every so often, I run into web design trends that I simply don't understand. Since I've already grumbled about small font sizes, today's contestant is the mysterious case of the vanishing links.

Technically they haven't vanished. Instead they've carefully been given a colour that is almost exactly the same as the regular text colour. Sometimes even exactly the same as the regular text colour. Ironically, in this situation the 'click here' style of link text actually becomes good, because it gives me valuable clues that there's something there to click.

I really don't get this sort of design. If the links are unimportant, why not just omit them? If you think that having any link colours in your nice text is ugly, why not just put all links in a 'resources' section at the bottom? Are there designers who actually think that 'color: #336699' is easily noticed when in the middle of 'color: #333'? (And that colour set is one of the more moderate examples. I've seen far worse.)

If you don't want me to use the links, why are you putting them in? If you do want me to use your links, why are you hiding them?

Someone please explain this. I'm lost.

StrangeWebDesignI written at 02:55:13; Add Comment

2006-03-09

Making things simple for busy webmasters

It's always nice when people's software saves me from having to wonder if they're up to no good by handing out obvious signs of it. Take, for example, the spate of people whose web crawling software advertises itself by having the User-Agent string of:

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Evidently no one told them not to stutter. (There are a couple of variations in what they claim to be, but that one is the most common. Needless to say, no real User-Agent string (MSIE's included) has an extra 'User-Agent: ' on the front.)

The IP addresses that sourced these are scattered all over; a couple of them are (still) on the XBL, and a couple are in SPEWS.

(And I give bonus points to the person with the User-Agent string "W3C standards are important. Stop fucking obsessing over user-agent already.", which I stumbled over while scanning our logs today. I can certainly agree with the sentiment.)

Another good one is the stealth spider that sends a completely blank Referer: header, instead of omitting it; it stands out like a sore thumb in my log scans. This comes from all over, with 157 different IP addresses over the past 28 days or so, 50 of them currently listed in the XBL.

ObviousNogoodniks written at 16:42:15; Add Comment

2006-03-06

A thought about Technorati

Technorati famously has some problems indexing blogs. I believe that a lot of these issues may come down to something simple: Technorati's real problem is that it predates the syndication feed revolution.

Post syndication revolution blog search engines (like Google Blogsearch, Feedster, IceRocket, and Bloglines) are actually syndication feed search engines. They operate by finding feeds and mining them for the entries (or at least URLs to the entries, if your feed has partial text).

But before the syndication revolution there were no widespread syndication feeds to mine. Instead, you had to spider the blog web pages themselves and then reverse engineer the HTML to try extract the blog entries.

This seems to be how Technorati operates; we can see a hint of this in their publishers help page, where they ask people to add special markup that will give their parser more clues. And as Chris Linfoot has noticed, they don't seem to pull feeds very much.

Technorati can hard-code handling for common blogging sites and blog packages, but in general this sort of heuristic reverse engineering is a hard task that needs continued tweaking. It's not surprising that it's prone to problems; perhaps it's more surprising that it works as well as it does without more help from bloggers.

Fundamentally I think that being a pre syndication era blog search engine is a significant handicap for Technorati. Syndication feeds are simply at least an order of magnitude easier to parse and work with than raw HTML pages; until Technorati does as much as possible with syndication feeds, and only falls back to parsing raw pages as a last resort, it's going to be working harder than competitors like IceRocket for less results and more problems.

Honesty compels me to admit that there's a certain amount of sour grapes in this, because Technorati is barely indexing WanderingThoughts at all. (Yes, I ping them like clockwork when I post. Yes, my HTML and my feed validates (or at least the feed validated back when the feed validator would still validate Atom 0.3 feeds).)

TechnoratiProblem written at 02:52:15; Add Comment

2006-03-02

A robots.txt surprise

Because I don't really like banning MSNBot, MSN Search's web spider, I decided to drop our ban and see if its behavior had improved since last September. The process of doing this has led me to a little surprise about how at least MSNBot matches User-Agent lines in robots.txt.

From looking at our logs, I already knew that MSNBot was still visiting; it pulled robots.txt at least once a day. So all I needed to do was change robots.txt so that it wouldn't be banned.

Since I wanted to note down when I removed the ban, I just added a suffix on the User-Agent string, changing from banning 'msnbot' to banning 'msnbot-reenabled-2006-02-14'. To my surprise nothing happened, so I changed it again, putting 'X-20060222-' on the front. Still nothing happened.

Finally, yesterday evening I changed 'msnbot' to 'mXsXnbXot'. Within 12 hours, MSNBot had started crawling pages here.

The MSNBot web page is rather non-specific about how MSNBot decides whether or not it's excluded; all of their examples certainly use just 'msnbot' as the User-Agent string. A prefix match made sense to me, since it doesn't hose people who put things like 'msnbot/1.0' in their robots.txt, but the rest was surprising.

It turns out that this is actually recommended behavior; the Standard for Robot Exclusion web page says:

The robot should be liberal in interpreting [the User-Agent] field. A case insensitive substring match of the name without version information is recommended.

I don't know how many robots follow this, but MSNBot evidently does. Good for them.

RobotsTxtSurprise written at 16:20:13; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.