A realization: on the modern web, everything gets visited

December 20, 2013

Once upon a time, a long time ago, you could have public web apps that exposed a few quite slow heavy-weight operations and expect to get away with this because users would only use those operations very occasionally. These might be things like specialized syndication feeds or looking up all resources with a particular label (tag, category, etc). You wouldn't want to be serving those URLs very often, but once in a while was okay and it wasn't worth the complexity of making even the stuff in the corner go fast.

Then the web spiders arrived. These days I automatically assume that any visible, linked-to URL will get found and crawled by spiders. It doesn't matter if I mark every link to it nofollow and annotate it with a content-type that should be a red flag of 'hands off, nothing interesting to you here'; at least some spiders will show up anyways. The result of this is that even things in the corner need to be fast because while humans may not use them very often, the spiders will. And there is are a lot of spiders and spider traffic these days (I remember seeing a recent estimate that over half of web traffic was from spiders).

(Spiders probably won't visit your really slow corners any more than the rest of your site. But unlike humans they won't necessarily visit them any less. URLs are URLs. And if your slow corners are useful indexes to your content, spiders may actually visit them more. I certainly wouldn't be surprised to find out that modern web crawlers keep track of what pages provide the highest amount of new links or links to changed content on an ongoing basis.)

One more or less corollary of this is that you (or at least I) probably want to plan for new URLs (ie, new features) to be efficient from the start. In the old days you had some degree of ramp up time, where you could deploy an initial slow version, see it get used a bit, tweak it, and so on; these days, well, the spiders are going to be arriving pretty soon.

(I have very direct experience that it doesn't matter how obscure or limited your links are; if links exist in public pages, spiders will find them and begin crawling through them. And one single link to an island of content is enough to start an avalanche of crawling.)

PS: all of this only applies to public web apps and URLs, and so far only to GET URLs that are exposed through links in HTML or other content. Major spiders do not yet stuff random things into GET-based forms and submit them to see what happens.

Written on 20 December 2013.
« Your (HTML) template language should have conditionals
If you're using Linux's magic SysRq, increase the log level right away »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Dec 20 01:54:49 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.