2005-12-21
How to get your web spider banned from here
To get your web spider banned here, do as many of the the following as possible:
- make a lot of requests, so that we notice you. 1,500 over two days,
for example.
- make repeated rapid requests for the same unchanging pages, in
case the sheer volume didn't get our attention. Over the same two
days, fetch something that hasn't changed since June 18th eight
times and a bunch of other similar pages seven times each.
- to make sure we notice you, crawl through links marked
nofollow. It's advisory, so never mind that the major search engines don't do this and people have come to expectnofollowcan be used as a 'keep out' sign that's more flexible thanrobots.txt. - use an uninformative and generic user agent string, like
"Jakarta Commons-HttpClient/3.0-rc4". - fetch our robots.txt file desultorily and only
several days after you start visiting.
- keep up the mystery of who you are by making all of your requests
from machines with no reverse DNS, like the machines that hit us
from 64.94.163.128/27.
- once we've identified your subnet as belonging to 'Meaningful
Machines', on no account have your own contact information in the
WHOIS data. I enjoy Googling to try to find the website of spiders
crawling us; it makes my life more exciting.
- Once I have found out that meaningfulmachines.com is your domain, make sure that your website has no visible information on your spidering activities. For bonus points, try to have no real information at all.
- extra bonus points are awarded for generic contact addresses that look suspiciously like autoresponders, or at least possible inputs to marketing email lists. (In this day and age, I don't mail mail 'information@<anywhere>' to reach technical people.)
Since I have no desire to block everyone using the Jakarta Commons code and no strong belief that Meaningful Machines is paying much attention to our robots.txt anyways, their subnet now resides in our permanent kernel level IP blocks.
(PS: yes, I sent them email about this last week, to their domain contact address. I haven't received any reply, not that I really expected one.)
Some Googling suggests that I am not alone in having problems with
them; one poster on webmasterworld.com (which blocks direct links, so
I can't give you a URL) reported seeing 60,000 requests in an hour
(and no fetching of robots.txt) in late May 2005. You may want to
peruse your own logs for requests from the 64.94.163.128/27 subnet and
take appropriate action.
2005-12-17
On the web, text colours are an all or nothing thing
Every so often I think about giving WanderingThoughts' breadcrumbs bar up at the top a background colour to make it stand out better (perhaps a nice yellow like Jakob Nielsen uses). But every time I've had that thought, I cringe at the amount of work involved and leave it alone.
It's a lot of work because on the web, there is no such thing as specifying just one text colour property. I can't just specify the background colour for the breadcrumbs; I would have to specify background, foreground, unvisited link colour, and visited link colour.
Since you have no way of knowing what the user's default colours are (at best you have a guess), you don't know whether or not your partial colour specification is clashing with their settings for the ones you didn't specify. A yellow background may look great for me, but what about the person who's set her browser up to look like jwz's livejournal, CRT green on black?
(For example, my browser background is not white but an off-white cream, RGB hex #fffff2. Among other things, this usually makes it immediately obvious when someone has specified a white background for only some parts of a web page.)
So it's not just picking a background colour for the breadcrumbs area of WanderingThoughts. I'd need a good set of four colours for it, and then I'd need a fifth colour for the global background colour to avoid a fruit salad effect for visitors with clashing browser colour schemes of their own. (Otherwise some bits of the page would be in their colours and some of mine, and there's no guarantee the two colour sets don't clash horribly.)
The lurking complication in picking colour sets is the various sorts of colour blindness, unless you are willing to write off perhaps 10% of your male visitors (see for example here or here). And if you try you're probably going to do a worse job of it than your colour blind visitors, who've likely already tuned their browser defaults to look nice for them.
(This is sort of a followup to ALittleDetailThatMatters, which pushed the whole issue of web page colours up in my mind.)
2005-12-15
Reddit versus Digg: a little detail that matters
Since reddit.com and digg.com started showing up on the geek radar, I've been checking them out. Since both are about the same thing, roughly a 'just the links' version of Slashdot's 'news for nerds' approach, I expected to like them about equally, or like digg.com more, since it has link summaries (which are often the most useful bit of Slashdot for me).
To my surprise, I've been barely visiting digg.com, but have found myself dropping by reddit.com frequently; it just felt nicer to use. It's taken me some time to realize why, and it turns out to have come down to one little difference in their website design.
The difference: on reddit.com, I can see what links I've already read; on digg.com, I can't, because digg.com has decided to have unvisited links and visited links be the same colour.
The little extra work of thinking about whether I'd already read an interesting looking digg.com link turned out to be enough of a turnoff that I quietly tuned out. On reddit.com, my browser does the remembering, and my eyes automatically skip over the darker links. No fuss, no muss, continued reading.
Digg also pushes me away with a small font size for the link summary text, the thing I am most interested in reading, forcing me to enlarge it in Firefox in order to read it comfortably. (I've written about this before, and it's even Jakob Nielsen's leading design mistake of 2005.)
Update: and shame on me for not noticing that not differentiating visited and unvisited links is part of Jakob Nielsen's number two design mistake of 2005. And he discusses it in more detail in an older Alertbox here.
2005-12-07
Google Desktop and conditional GET
Some people using Google Desktop have been pulling my syndication feeds recently, which gives me the opportunity to see how well Google Desktop implements conditional GET. Unfortunately, the results are either mixed or unclear.
(At this point I have to repeat the disclaimer from my earlier entry about this: I like having readers and we have a lot of spare bandwidth. This is absolutely not a request for people using Google Desktop to stop reading my feed.)
Google Desktop appears to sometimes but not always send (valid) If-None-Match and If-Modified-Since headers; at the moment, 26 requests out of 72 over the past week. Just seven of those requests managed to get 304 'nothing changed' responses (from only three out of the five different IP addresses hitting me with Google Desktop).
There certainly seem to be times when Google Desktop failed to get 304 Not Modified responses that it should have been able to get, so I have to conclude that in at least some circumstances, Google Desktop's conditional GET support is broken. It's clearly not entirely broken, since sometimes it does manage to have everything work right.
Interestingly, I see a pattern for a particular IP (that is unlikely to be shared) where Google Desktop made a first feed request with nothing, made a second feed request with INM and IMS about twelve hours later, found that the feed had changed, and then never sent INM and IMS again when it was re-fetching the feed. Hopefully Google Desktop is not taking the lack of a 304 response as an indication that the feed doesn't actually support conditional GET, and then not bothering to send the headers later on.
So it seems like Dare Obasanjo's experience from September of complete lack of support for conditional GET is not the full story (or at least part of it has been fixed since then). Unfortunately, just what is going on is not clear, although it seems likely that there is at least some problem.
(There are also some alarming reports about other Google Desktop actions in the comments to this entry.)