2005-10-29
The problem with If-Modified-Since as a timestamp
The If-Modified-Since header in HTTP requests is used as one way of
doing 'conditional GET' requests, where the web server can give you a
nice bandwidth-saving '304' response if the URL hasn't changed since
the version you already have. In theory the header has an arbitrary
timestamp and the server will 304 the request unless the page has
changed since that time.
In practice, the HTTP RFC strongly recommends that clients treat the
value as a magic cookie, just repeating the value the server last told
them. Unfortunately, not everyone does this. This is bad, because it
is very difficult for the server to use If-Modified-Since as a real
timestamp.
To accept If-Modified-Since as a timestamp, your last modified time
has to go forward any time a change is made. Or, to put it another
way, you have to guarantee that your last modified times will always
go forward. This sounds nice in theory but is extraordinarily
difficult in practice, even for static web pages.
For example, here's a bunch of things you have to consider for static pages on Apache on Unix:
- did I rename the old 'page.bak' to 'page.html'?
- did I rename a higher-level directory to flip to another version of an entire section?
- did I change rules in a
.htaccessfile that affects the page? - did I modify the web server's configuration files, ditto?
(Even if Apache made all these checks, it still couldn't guarantee that I hadn't just had to flip a backup server with a not yet fully up to date copy of the website into production when my main server exploded.)
Dynamic web pages have it worse, in part for the reasons in Pitfalls in generating Last-Modified; they have more moving pieces and the moving pieces don't always come with change times attached.
Doing this right requires the application generating dynamic pages to find out any time anything affecting those pages changes. There are 'total control' applications like that out there, but not very many, as such total control has significant costs (for a start, everything has to go through the application).
DWiki is honest about not being able to make its own timestamps always
run forward, and thus requires strict If-Modified-Since matching as
a minimum. To do otherwise would run the risk of erroneous 304s,
which in my opinion are much more serious than some extra bandwidth
used.
(People with more constrained bandwidth may feel otherwise. But in
that case they should fix their clients to use ETag.)
2005-10-24
On banning web search engines
There are only a few popular search engines at any time. They are doing you a favour by indexing you, because you need their users far more than they need your content. You are are doing everyone else a favour by letting them index you, since they need content (yours included) to attract users.
When people are doing you a favour, it's worthwhile to go out of your way to help them out.
When you are doing search engines the favour, it's their job to go out of their way to not be a burden, not yours. Given that we're already doing unpopular search engines a favour to start with, my patience for doing any more work to accommodate their special needs is about nil.
I still have a spirit of neighborliness and thus I'm inclined to let web spiders rove freely. But I'm also a pragmatist; when a web spider starts making us notice it, I want to know what we're getting out of it. Web spiders that want to keep crawling here had better have a decent answer.
(It is surprising how few web spiders have web pages that try to explain why I would want to let them crawl us.)
2005-10-21
How ETags and If-Modified-Since headers interact
Part of the fun of writing programs that deal with HTTP is decoding things like RFC 2616 to answer somewhat obscure questions about how various things interact. Today's case is the following question:
When can your web server generate a 304 'content not modified' response if it receives a request with both an
If-None-Matchand anIf-Modified-Sinceheader?
If-None-Match and If-Modified-Since are HTTP request headers used
to implement 'conditional GET', a bandwidth saving technique that
avoids re-fetching unchanged pages (see
here
or here for
more discussion of this).
(ETag headers come into this because the server's ETag value is
what the client will use as its If-None-Match value in the
conditional GET request.)
The answer turns out to be in section 13.3.4 of RFC 2616. It is (de-RFC-ized):
You can only generate a 304 response if both headers match; the
If-None-Matchmatches the response'sETagand theIf-Modified-Sinceheader matches theLast-Modified.
In the case of If-Modified-Since and Last-Modified, servers may
require an exact match instead of merely Last-Modified being no later
than If-Modified-Since. As RFC 2616 notes in
14.25,
client authors should really just store the Last-Modified result as a
string and hork it up in their If-Modified-Since header.
This came up when I threw debugging code into DWiki to see exactly what various people repeatedly pulling my Atom feed without getting bandwidth-efficient 304 responses were sending. One feed reader was sending both headers but making up their own If-Modified-Since instead of just repeating Last-Modified. (DWiki requires an exact match for technical reasons.)
(Whether by accident or by reading RFC 2616 carefully when I wrote the code and then forgetting it, DWiki does the right thing when both headers are present.)
MSNbot (still) has problems with binary files
Dating back to our first experiences with msnbot, the MSN Search web crawler, I've known that it was kind of crazy about repeatedly fetching large binary files. Since then, we have pointed this issue out to MSN Search people more than once and switched to using accurate Content-Types. Recently we had a week of MSNbot not refetching those large binaries, so it looked like MSNbot had finally been fixed.
So much for that. Since 7pm Wednesday night, MSNbot has fetched 3.1 gigabytes of various large, unchanging 'application/<definitely not text>' files from us. Highlights of the experience include MSNbot fetching fetching the same 537 megabyte ISO image six times (once less than twenty minutes after the previous fetch).
It is clear that MSNbot simply does not deal correctly with binary files, things served with various 'application/<whatever>' content types. There are a few application/* content types that are appropriate to index (PDFs, for example), but for us MSNbot definitely goes far beyond that.
From things I've heard, it would not surprise me if MSNbot ignores the
content-type and just relies on a hard-coded list of URL extensions to
not crawl. (Presumably things like .exe and .zip are in there.)
This is completely brain-damaged, since extensions on URLs don't
necessarily have anything to do with their content-type. For example,
you will search high and low to find a .html extension in DWiki.
(Yes, some web servers use the file extension as part of the process
to decide on what Content-Type: header to generate. This is an
internal implementation detail.)
I doubt we're the only site experiencing this issue. If you have large binary files on your site, I strongly urge you to check your server logs for similar behavior.
2005-10-19
Thoughts on Jakob Nielsen on weblog usability
I cannot resist pointing to Jakob Nielsen's latest column, Weblog Usability: The Top Ten Design Mistakes. If you're interested in usability issues, and anyone interested in making effective web sites had better be, his columns are usually worthwhile. (One of my embarrassments is that I haven't been reading him much lately; I used to care about this quite a lot.)
I think that Nielsen's weblog usability issues, like a fair amount of his writing, are aimed at what I would call 'commercial' weblogging; weblogging done more or less explicitly with selling yourself in mind. People blogging for other reasons should apply a certain amount of salt. For example, I know quite a few bloggers who consider it a feature that their weblog doesn't have a detailed author bio or any sort of author photo.
(For another perspective on this, see the recent issue of prominent academic political science blogger Daniel Drezner failing to get tenure at the University of Chicago, such as this New York Sun article. At least one contributor to the Volokh Conspiracy legal blog is explicitly doing so pseudonymously until he gets tenure, and I'm sure there are bloggers who have adopted online pseudonyms that are just less obvious.)
Reading Nielsen's list of ten issues makes me a little bit
rueful. WanderingThoughts is certainly failing on several that I think
are important, such as giving links good titles and not having a a
list of my 'greatest hits' (or at least what I think are my most
interesting articles; Google tells me that my greatest hit is still
this article on a bad yum error message).
And most blogs could stand to improve their long-term navigation. (At
least I recently put in global index pages.)
However, I disagree about a lack of regular updates being a serious usability issue. In the old days before RSS frequent updates were very useful (and merely regular ones a useful fallback position). Today, the growth of syndication feeds and feed readers makes this much less important. (Even Nielsen's Alertbox has had a low-tech syndication feed for years, in the form of a mailing list for announcing new columns.)
2005-10-15
Estimating search engine popularity
First question: why bother, apart from idle curiosity? For a start, I'm interested in knowing how worthwhile it is to do things that help a specific search engine.
Search engines care a lot about their popularity, with the result that you can't really trust anything they say about it. I'm not sure you can trust any other source of information, and to get solid data you probably have to pay money for it.
Besides, I don't care about global search engine popularity; I care about how popular the various search engines are with the sort of people who visit our website.
This has a simple answer: look at your Referer logs to see how many
people came from each search engine that you can identify. The fly in
the ointment is that this assumes you're equally high in the search
results on each search engine, since people tend to go more to early
search results.
To compensate for this, I look to see how we rank on each search engine for the various queries people use. However, it's not enough to just use popular queries, because there may be political factors involved in which search engine gets used. For example, the top search that brings people to WanderingThoughts is a search for a specific Linux error message; how likely are Linux people to use, eg, MSN Search? So I tend to use politically neutral search queries, for example spam-related searches.
This compensation is always going to be an imprecise process, so you are only going to get a rough ballpark estimate of what the popular search engines are. If necessary, you can look only at search queries where you have a roughly equal ranking on all of the search engines. (I don't know if anyone really knows how hits fall off as your search ranking drops.)
Our results are that the only popular search engine is Google. MSN Search has only a smidgen of users (and our search rankings are about the same as in Google), and the others might as well not exist.
2005-10-11
Improve your web experience by turning Javascript off
One of the best things I've ever done to improve my web browsing experience is very simple: I turned Javascript off almost from its introduction.
Turning Javascript off has a host of benefits, because Javascript on web pages has always been primarily used either for evil or for flashy, distracting user interface elements. With it off I avoid all of that, including popups and links that hide where they're actually going to. Among other advantages, this makes it much less nerve-wracking to go to strange new websites.
I'm not a purist about this; if a website has content I want to see that needs Javascript, I'll turn it back on. This happens much less than you might think; most websites are not all that dependent on whatever Javascript they may have running around. With the PrefBar extension, enabling and disabling Javascript is a snap for Mozilla and Firefox users; a keystroke to bring the PrefBar toolbar up, a click of a tickbox, and a page refresh and you're done.
(People less obsessed then me with leaving as much space as possible for the web page text can leave the PrefBar toolbar up all the time.)
Even if you think you use lots of websites that require Javascript, install PrefBar and give it a try; you may be pleasantly surprised how little you really need Javascript after all. And even if it doesn't work out, you'll have a quick way to disable Javascript before you visit a website that you don't trust.