Wandering Thoughts archives

2006-05-21

My mistake with the Host: HTTP header

One of the nice things about writing a blog is getting to say 'oh, oops, I was a dumbass, let me fix that'. Today I have to own up to a big example of this.

Back at the start of WanderingThoughts, I wrote an entry where I complained in part:

In theory the absolute URL should include the port (unless it's the default). In practice, every program I've tried gleefully adds the port itself if it is a non-standard port and you're referring to the same hostname.

I was a moron.

The Host: header in HTTP requests includes the port when the port is a non-standard one (and some programs throw it in even when you're on port 80, as I found out later). My code looked more or less like:

newuri = "http://%s:%d" % (HostHeader, MyPort) + relUrl

When programs gave me real Host: headers, where HostHeader included both hostname and port, I effectively doubled the port and things naturally exploded. Had I printed the actual Host: header that programs were handing DWiki I would have seen my mistake immediately, but instead I was too confidant that I knew what was going on and didn't bother; I trusted my testing with hand-crafted HTTP requests, where I'd gotten the Host: header wrong and so the result looked right.

I only found all this out months later when I was doing something else with the Host: header that blew up because I didn't know to expect the ':port' on the end; that time I dumped debugging information, partly because the failure was more mysterious.

My mistake is all the more embarrassing because, contrary to what I wrote in the original entry, the proper behavior is described in black and white in the HTTP 1.1 RFC's section on the Host header. I am not sure what RFCs I read at the time of the original entry, but evidently I didn't read the important one.

HostMistake written at 21:04:59; Add Comment

2006-05-15

PlanetLab hammers on robots.txt

The Planet Lab consortium is, to quote its banner, 'an open platform for developing, deploying, and accessing planetary-scale services'. Courtesy of a friend noticing, today's planetary-scale service appears to be repeatedly requesting robots.txt from people's webservers.

Here, they've made 523 requests (so far) from 323 different IP addresses (PlanetLab nodes are hosted around the Internet, mostly at universities; they usually have 'planetlab' or 'pl' or the like in their hostnames). The first request arrived at 03:56:11 (Eastern) on May 14th, and they're still rolling in. So far, they haven't requested anything besides robots.txt.

All of the requests have had the User-Agent string:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6

This User-Agent string is a complete lie, which is one of the things that angers me about this situation. The minimum standard for acceptable web spider behavior is to clearly label yourself; pretending that you are an ordinary browser is an automatic sign of evil. If PlanetLab had a single netblock, it would now be in our port 80 IP filters.

Apparently the PlanetLab project responsible for this abuse is called umd_sidecar, and has already been reported to the PlanetLab administration by people who have had better luck navigating their search interfaces than I have. (It looks like the magic is to ask for advanced search and then specify that you want TCP as the protocol.)

PlanetLabGoesRobotic written at 01:18:50; Add Comment

2006-05-10

Another really stupid web spider

I have to take back what I said just the other day about having seen the worst stealth spider I'd ever seen. The day after I wrote that entry, I saw a worse case.

Like the first one, they made two valid requests to start with and then followed them up with 65 bad ones, all in the span of 11 seconds. Also like the first one, all their requests were bad because they couldn't deal with absolute paths in <a href="...">. But they topped the first one because they lowercased all the URLs.

This right here is me clutching my head like a stunned monkey.

All 67 requests came from 69.56.135.218; according to theplanet.com, this is part of 69.56.135.216/29, assigned to 'PQC Service, LLC' of Wilmington Delaware, zip code 19801. All of the machines have generic reverse DNS. (Some Googling suggests that the company runs porn sites.)

ReallyStupidSpiderII written at 16:17:40; Add Comment

2006-05-08

A really stupid web spider

Today WanderingThoughts had a visit from the worst stealth spider that I've ever seen. Given the previous contestants this is a fairly tall order, but I'm confidant I have a winner. The spider:

  • made two requests for directories without the trailing slash, earning it redirections to the proper URLs.
  • followed the redirections, making two valid requests.
  • promptly made 95 bad requests by failing to treat <a href="..."> URLs with a leading slash properly.

I've seen spiders that didn't handle absolute path URLs before, but this is a new and spectacular level of failure. They failed to crawl a single page past their two start pages; all things considered I'm surprised that they even handled the initial redirections properly.

(They're a stealth spider because they claimed to be a variety of harmless Windows based browsers. This is utterly false; first, the browsers would have gotten the requests right, and second very few make 99 requests in 14 seconds from 42 different IP addresses in the same subnet.)

The details

All 99 requests were made in the spam of 14 seconds, from 42 different IP addresses between 66.90.95.207 and 66.90.95.254. WHOIS says that this is part of a /18 owned by fdcservers.net. Unfortunately, fdcservers.net does not have a working whois server and these IP addresses have no reverse DNS; the IPs answer on port 25, but only with a very generic identification.

There's some evidence from Google searches that this is a botnet for some sort of spam, eg here. The 66.90.110.* IP range that this person reports also came by our server, on May 4th. The requests show some traces of a similarly incompetent spider, but they had the luck to hit an area of the site with mostly relative links (and Apache generously fixed up some of their mistakes, like the requests with '/../' in them).

(Nothing from 66.90.95. has hit here before today, at least for the past 28 days of logs that we have, and 66.90.110. only hit us the once on May 4th.)

ReallyStupidSpider written at 02:26:46; Add Comment

2006-05-02

CSS and syndication (another CSS limitation)

In an earlier entry on Solaris patch exit codes I mentioned that its formatting might not look very nice in syndication feeds. And I was right; it did not look very good.

It didn't look good because it was styled with CSS, and you can't really put CSS in syndication feeds. CSS goes into the HTML <head> element, but that only appears in HTML pages; syndication feeds and entries are divorced from that sort of context. This is more or less implicit in RSS (like a lot of things), but is explicit in Atom 1.0; both HTML and XHTML entries must be markup that is valid inside a <div>.

(Technically you can wedge a certain amount into style="..." attributes on elements, but you can't get everything. And it significantly bulks things up. And apparently a number of feed readers strip style attributes because they can contain various dangerous things, like JavaScript.)

You can substitute deprecated HTML features for some things, but not always. For example, in the earlier entry, I had a 'horizontal' table, one where the important thing was the relationships between the columns in the same row; in CSS it has light thin horizontal lines after each row. In syndication feeds I eventually decided that the best approximation (or at least the least annoying to me) was for there to be no borders at all.

Because of this limitation and my desire for my entries to be decently readable in feed readers, I've generally held back from making much use of CSS in WanderingThoughts. Sometimes it's very tempting, though; for example, I really like the CSS appearance of 'horizontal' tables. (I'm willing to live with them not looking good in lynx, for fuzzy reasons.)

Because WanderingThoughts generates HTML from higher level markup, I'm actually better off than many people; I could in theory render entirely different HTML for the Atom feeds (inlining styles and all that) than for regular page views. (The Atom feed already changes some stuff, for example to make all links into absolute URLs.)

(Credit where credit is due department: I got the idea for the 'horizontal' table style from how CJ Silverio's Snippy styles tables.)

CSSAndSyndication written at 02:12:31; Add Comment

By day for May 2006: 2 8 10 15 21; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.