2007-12-29
SNI doesn't work in practice
A while back I talked about why SSL and name-based virtual hosts don't get along, and a commentator pointed me at the 'Server Name Indication' (SNI) SSL extension to deal with this. The approach SNI takes is for the client to send the server name(s) it wants to talk to as part of the initial handshake; a SNI-aware server can use this to pick the right server certificate right away.
(The whole thing is described in RFC 4366 and summarized here.)
The problem with SNI is that support for it is almost completely lacking in both servers and clients, and has been this way for several years. For servers, no unpatched release of either Apache or lighttpd supports it yet.
(I'm deliberately excluding mod_gnutls here, because I consider things like the lack of a Debian package to be a serious 'do not use' warning sign.)
For clients, the only widely used client that supports it is Firefox 2; IE 7 only supports it when run on Vista, and it's not supported in Mac OS X's WebKit (the base of Safari, among other OS X browsers). The situation with IE is especially problematic, as SNI support depends on the version of the Windows TLS library. If Microsoft is not going to update that on Windows XP, there are a huge number of people who are not going to have SNI available for years.
In short: SNI is a nice idea but not a practical one. I suspect that it's been caught in a chicken and egg scenario, where everyone who needed to do some of the work didn't seem much demand for it because none of the other pieces were there either.
(There's at least three pieces: the web servers, the browsers, and the SSL crypto libraries.)
Sidebar: various resource links
- a SNI test site, so you can see if your environment supports it.
- Daniel Lange's summary of what supports SNI
- the apache ticket to add SNI
- the lighttpd ticket to add SNI
- a general resource page on the overall https and virtual hosts issue
2007-12-03
A comment spam precaution that didn't work out
Every now and then I try a comment spam precaution and it backfires on me. So let me amend my previous remarks: it turns out that refusing comments from people that are on the XBL is a bad idea.
It's a superficially attractive idea, which is why I implemented it way back when; the XBL is (theoretically) listing addresses of compromised machines and open proxies, and I have seen comment spam attempts from XBL-listed IP addresses. But the XBL itself contains warnings against this sort of usage, and in practice I don't think the XBL check ever did anything, because all the comment spam got dealt with by earlier precautions.
Then today, the problem with this was unpleasantly illustrated when a would-be commentator to had their legitimate comment blocked because they had an XBL-listed dynamic IP address (likely because they'd inherited it). Whoops, and clearly wrong.
(Worse yet, I didn't think the possibility of a misfire was high enough to warrant giving a clear error message. Which is stupid, all things considered; the kind of spammer that uses open proxies is not the kind that actually reads the web pages that they get back.)
All in all, a humbling mis-judgement. I've pulled the code until I can reform it (I think I still want to block any comment attempts from SBL-listed IP addresses, although I may be wrong about that too).
(And I apologize to the unknown person today who got hit by this, if they happen to still be reading.)
2007-12-01
My expectations for responsible spider behavior
My minimum technical requirements for real web spiders are deliberately quite black and white. But there are also a number of more fuzzy things that I expect from responsible web spiders. Bear in mind that these aren't hard and fast rules and I can't give precise numbers and so on.
(As before, this only applies to what I'm calling 'real' or 'legitimate' web spiders; I can't expect any particular behavior from malicious web spiders.)
Disclaimers in place, here's what I expect of responsible web spiders:
- check
robots.txtfrequently and adjust your behavior rapidly, say within no more than two days.(I do not care what infrastructure you require to do this; the fact that
robots.txtupdates have to propagate around six layers of your internal topology before reaching the crawler logic are your problem, not mine.) - don't make requests more frequently than one every few seconds or so.
- more importantly, notice when the website is slowing down and slow
down yourself. If the website's response speed is down, this is a
very big clue that your spider should space out requests more.
- don't rapidly re-crawl things that haven't changed. It's reasonable to check a few times just to make sure that what looks like unchanging content really is, but after that spiders should slow down. If you spend months revisiting a page three times a week when it hasn't changed in years, I get peeved.
- URLs that get errors count as unchanged pages. Crawl them a few times to make sure that they stay errors, but after that you should immediately demote them to the bottom of your crawl rates.
- this goes triple if the error you are getting is a 403 error, because you are being told explicitly that this is content you are not allowed to see.
Disclaimer: as before, I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.
(Suggestions of more are welcome; I'm probably missing some obvious ones.)