Wandering Thoughts archives


Googlebot is both quite fast and very determined to crawl your pages

I recently added support to DWiki (the engine behind Wandering Thoughts) to let me more or less automatically generate 'topic' index pages, such as the one on my Prometheus entries. As you can see on that page, the presentation I'm using has links to entries and links to the index page for the days they were posted on. I'm not sure that the link to the day is particularly useful but I feel the page looks better that way, rather than just having a big list of entry titles, and this way you can see how old any particular entry is.

The first version of the code had a little bug that generated bad URLs for the target of those day index page links. The code was only live for about two hours before I noticed and fixed it, and the topic pages didn't appear in the Atom syndication feed, just in the page sidebar (which admittedly appears on every page). Despite that short time being live, in that time Googlebot crawled at least one of the topic pages and almost immediately began trying to crawl the bad day index page URLs, all of which generated 404s.

You can probably guess what happened next. Despite always getting 404s, Googlebot continued trying to crawl various of those URLs for about two weeks afterward. At this point I don't have complete logs, but for the logs that I do have it appears that Googlebot only tried to crawl each URL once; there just were a bunch of them. However, I know that its initial crawling attempts were more aggressive than the tail-off I have in the current logs, so I suspect that each URL was tried at least twice before Googlebot gave up.

(I was initially going to speculate about various things that this might be a sign of, but after thinking about it more I've realized that there really is no way for me to have any good idea of what's going on. So many things could factor into Googlebot's crawling decisions, and I have no idea what is 'normal' for its behavior in general or its behavior on Wandering Thoughts specifically.)

PS: The good news is that Googlebot does appear to eventually give up on bad URLs, or at least bad URLs that have never been valid in the past. This is what you'd hope, but with Googlebot you never know.

web/GoogleCrawlingPersistence written at 23:15:31; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.