2006-05-21
My mistake with the Host: HTTP header
One of the nice things about writing a blog is getting to say 'oh, oops, I was a dumbass, let me fix that'. Today I have to own up to a big example of this.
Back at the start of WanderingThoughts, I wrote an entry where I complained in part:
In theory the absolute URL should include the port (unless it's the default). In practice, every program I've tried gleefully adds the port itself if it is a non-standard port and you're referring to the same hostname.
I was a moron.
The Host: header in HTTP requests includes the port when the port is
a non-standard one (and some programs throw it in even when you're on
port 80, as I found out later). My code looked more or less like:
newuri = "http://%s:%d" % (HostHeader, MyPort) + relUrl
When programs gave me real Host: headers, where HostHeader included
both hostname and port, I effectively doubled the port and things
naturally exploded. Had I printed the actual Host: header that
programs were handing DWiki I would have seen my mistake immediately,
but instead I was too confidant that I knew what was going on and didn't
bother; I trusted my testing with hand-crafted HTTP requests, where I'd
gotten the Host: header wrong and so the result looked right.
I only found all this out months later when I was doing something else
with the Host: header that blew up because I didn't know to expect the
':port' on the end; that time I dumped debugging information, partly
because the failure was more mysterious.
My mistake is all the more embarrassing because, contrary to what I wrote in the original entry, the proper behavior is described in black and white in the HTTP 1.1 RFC's section on the Host header. I am not sure what RFCs I read at the time of the original entry, but evidently I didn't read the important one.
2006-05-15
PlanetLab hammers on robots.txt
The Planet Lab consortium is, to
quote its banner, 'an open platform for developing, deploying, and
accessing planetary-scale services'. Courtesy of a friend noticing, today's
planetary-scale service appears to be repeatedly requesting robots.txt
from people's webservers.
Here, they've made 523 requests (so far) from 323 different IP
addresses (PlanetLab nodes are hosted around the Internet, mostly at
universities; they usually have 'planetlab' or 'pl' or the like in their
hostnames). The first request arrived at 03:56:11 (Eastern) on May 14th,
and they're still rolling in. So far, they haven't requested anything
besides robots.txt.
All of the requests have had the User-Agent string:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6
This User-Agent string is a complete lie, which is one of the things that angers me about this situation. The minimum standard for acceptable web spider behavior is to clearly label yourself; pretending that you are an ordinary browser is an automatic sign of evil. If PlanetLab had a single netblock, it would now be in our port 80 IP filters.
Apparently the PlanetLab project responsible for this abuse is called umd_sidecar, and has already been reported to the PlanetLab administration by people who have had better luck navigating their search interfaces than I have. (It looks like the magic is to ask for advanced search and then specify that you want TCP as the protocol.)
2006-05-10
Another really stupid web spider
I have to take back what I said just the other day about having seen the worst stealth spider I'd ever seen. The day after I wrote that entry, I saw a worse case.
Like the first one, they made two valid requests to start with and then followed them up with 65 bad ones, all in the span of 11 seconds. Also like the first one, all their requests were bad because they couldn't deal with absolute paths in <a href="...">. But they topped the first one because they lowercased all the URLs.
This right here is me clutching my head like a stunned monkey.
All 67 requests came from 69.56.135.218; according to theplanet.com, this is part of 69.56.135.216/29, assigned to 'PQC Service, LLC' of Wilmington Delaware, zip code 19801. All of the machines have generic reverse DNS. (Some Googling suggests that the company runs porn sites.)
2006-05-08
A really stupid web spider
Today WanderingThoughts had a visit from the worst stealth spider that I've ever seen. Given the previous contestants this is a fairly tall order, but I'm confidant I have a winner. The spider:
- made two requests for directories without the trailing slash, earning it redirections to the proper URLs.
- followed the redirections, making two valid requests.
- promptly made 95 bad requests by failing to treat <a href="..."> URLs with a leading slash properly.
I've seen spiders that didn't handle absolute path URLs before, but this is a new and spectacular level of failure. They failed to crawl a single page past their two start pages; all things considered I'm surprised that they even handled the initial redirections properly.
(They're a stealth spider because they claimed to be a variety of harmless Windows based browsers. This is utterly false; first, the browsers would have gotten the requests right, and second very few make 99 requests in 14 seconds from 42 different IP addresses in the same subnet.)
The details
All 99 requests were made in the spam of 14 seconds, from 42 different IP addresses between 66.90.95.207 and 66.90.95.254. WHOIS says that this is part of a /18 owned by fdcservers.net. Unfortunately, fdcservers.net does not have a working whois server and these IP addresses have no reverse DNS; the IPs answer on port 25, but only with a very generic identification.
There's some evidence from Google searches that
this is a botnet for some sort of spam, eg here. The
66.90.110.* IP range that this person reports also came by our server,
on May 4th. The requests show some traces of a similarly incompetent
spider, but they had the luck to hit an area of the site with mostly
relative links (and Apache generously fixed up some of their mistakes,
like the requests with '/../' in them).
(Nothing from 66.90.95. has hit here before today, at least for the past 28 days of logs that we have, and 66.90.110. only hit us the once on May 4th.)
2006-05-02
CSS and syndication (another CSS limitation)
In an earlier entry on Solaris patch exit codes I mentioned that its formatting might not look very nice in syndication feeds. And I was right; it did not look very good.
It didn't look good because it was styled with CSS, and you can't really put CSS in syndication feeds. CSS goes into the HTML <head> element, but that only appears in HTML pages; syndication feeds and entries are divorced from that sort of context. This is more or less implicit in RSS (like a lot of things), but is explicit in Atom 1.0; both HTML and XHTML entries must be markup that is valid inside a <div>.
(Technically you can wedge a certain amount into style="..."
attributes on elements, but you can't get everything. And it
significantly bulks things up. And apparently a number of feed readers
strip style attributes because they can contain various dangerous
things, like JavaScript.)
You can substitute deprecated HTML features for some things, but not always. For example, in the earlier entry, I had a 'horizontal' table, one where the important thing was the relationships between the columns in the same row; in CSS it has light thin horizontal lines after each row. In syndication feeds I eventually decided that the best approximation (or at least the least annoying to me) was for there to be no borders at all.
Because of this limitation and my desire for my entries to be decently readable in feed readers, I've generally held back from making much use of CSS in WanderingThoughts. Sometimes it's very tempting, though; for example, I really like the CSS appearance of 'horizontal' tables. (I'm willing to live with them not looking good in lynx, for fuzzy reasons.)
Because WanderingThoughts generates HTML from higher level markup, I'm actually better off than many people; I could in theory render entirely different HTML for the Atom feeds (inlining styles and all that) than for regular page views. (The Atom feed already changes some stuff, for example to make all links into absolute URLs.)
(Credit where credit is due department: I got the idea for the 'horizontal' table style from how CJ Silverio's Snippy styles tables.)