Why nofollow is useful and important

July 11, 2006

No less a person than Google's Matt Cutts recently spoke up about herding Googlebot and more or less recommend using the noindex meta tag on pages instead of nofollow on links to them (on the grounds that it's more of a sure thing to mark pages noindex than to make sure that all links are marked nofollow).

I must respectfully disagree with this, because in one important respect meta noindex isn't good enough. The big thing that nofollow does that meta noindex can't do is that it makes good web spiders not fetch the target page at all. Which means that you didn't have to send it, and for dynamic pages that you didn't have to generate it.

(This is especially important for heavily dynamic websites that have a lot of automatically generated index pages of various sorts.)

I really don't want to be burning my CPU cycles to generate pages that web spiders will just throw away again; frankly, it's annoying as well as wasteful. This is a good part of why I am so twitchy about spiders respecting nofollow.

(In fact I care more about this than about helping Google reduce redundancy in their indexes, which is one reason why WanderingThoughts has lots of nofollow but no meta noindex. Plus, getting good indexing for a blog-oid thing is much harder than just sprinkling some noindex magic over bits.)

Sidebar: why not robots.txt?

In theory, robots.txt is supposed to be the way to tell web spiders to avoid URLs entirely. However, there are two problems with in practice. First, the format itself is inadequate for anything except blocking entire directory hierarchies. Second, it's the wrong place; the only thing that really knows whether a page should be spidered is the thing generating the page.


Comments on this page:

By DanielMartin at 2006-07-12 14:53:35:

I think that's a mischaracterization of Matt Cutt's remarks. Specifically, he says:

Bear in mind that if other pages link to a url, Googlebot may find the url through those other paths. If you can, I’d recommend using .htaccess or robots.txt (at a directory level) or meta tags (at a page level) to be safe. I’ve seen people try to sculpt Googlebot visits at the link level, and they always seem to forget and miss a few links.

So he's saying that "nofollow" is not the way to go if your goal is to keep certain pages out of Google's index. Also, the "to be safe" phrasing implies to me at least that he's advocating an additional measure, not necessarily a replacement.

And of course you failed to mention the main justification for "nofollow" - cases where the owner of the page with the link on it does not wish search engines to take note of the fact that there is a link to the target of the link. (i.e. denying comment spammers their googlejuice)

Although I can see how "nofollow" could appear to be useful for expensive dynamically generated pages, be aware that if you have pages other people can link to, someone somewhere will link to one of those pages from another site out of your control. (or they'll visit it while they have the google toolbar installed and reporting urls to google) Then, your use of "nofollow" to keep google's spider from even retrieving the page (and costing you the page generation time) does you no good. Requesting others to use "nofollow" every single time they link to certain of your pages strikes me as tilting at windmills.

The best approach, in my opinion, is to segregate dynamic pages from the rest of the site and use a robots.txt file.

By cks at 2006-07-12 16:01:19:

The best approach, in my opinion, is to segregate dynamic pages from the rest of the site and use a robots.txt file.

I don't believe this is feasible without huge pain in general. Consider WanderingThoughts, for example; I want spiders to index the actual content but I don't want them requesting Atom feeds at all, I'd like them to not fetch 'add comments' pages, and I'd rather like them not to index any of the range-based aggregated pages (because they change a lot).

(And this is a simplification of the real rules.)

Trying to corral each of those categories into an entirely separate directory hierarchy that I can wall off in robots.txt would lead to ugly URLs, because I would have to make the view of the page a top-level entity (/atom/blog/..., /range/10-20/blog/..., etc) instead of something that comes at the end.

Written on 11 July 2006.
« A suggestion for people with 'Out of Office' autoreplies
Why I like 'lib64' better than 'lib32' for the x86_64 transition »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Jul 11 02:13:01 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.