Wandering Thoughts archives

2013-05-31

The mystery of POSTs with a zero Content-Length

One of the joys of running web software that is rather paranoid is getting to see all sorts of weird things that float around the web, generally run by spammers and other people who are up to no good. Today's oddity could be called 'the case of the zero-length POST' and is just what it sounds like: POST requests that have a Content-Length of 0 bytes.

(Or at least they have a Content-Length of '0' after Apache gets through passing the request to DWiki. It's possible that Apache is silently sanitizing some bizarre C-L header value to 0.)

I don't know the full headers for these POST requests but because of the code flow inside DWiki, I know that they claim to be form submissions (I check Content-Type before looking at Content-Length). It's possible that the requests have some other header that is supposed to preempt Content-Length. It's also possible that this software is submitting empty POST form requests to see what happens or because this evades security precautions in some applications.

Based purely on the claimed User-Agent values I can say that this software is up to no good, since some of the time it claims to be Googlebot (or at least some of the requests claim to be from Googlebot, since I suppose I shouldn't assume that there's just one piece of software that's doing this). All of the requests I've pulled out of the logs seem to be HTTP/1.1 requests and generally are for regular URLs. The software involved also seems to almost always lower-case the URLs it's using, which very much doesn't work very well here.

(Looking at User-Agent suggests there may be two different programs involved, one of which claims to be Googlebot and one of which doesn't sent a User-Agent at all. Only the Googlebot-faker seems to lowercase its URLs; the other program mostly POSTs to Wandering Thought's main page but occasionally POSTs to other, correctly-cased URLs. The second program seems to be the more active one.)

I don't have any answers to this particular mystery and in fact now that I've looked into it it's more mysterious than before. Sometimes that's how it goes on the web these days.

Sidebar: volume and source details

These requests aren't happening in high volume but are generally happening several times a day from various different IPs. In the last ten days there have been at least 150 instances from 50 different IPs; the most prolific five IPs made 22, 13, 11, 8, and 6 requests each respectively. I haven't tried to run down the origin of all of the IPs, but China shows up a lot in the top-N list. One IP is currently in the SBL, in SBL181621 (a /24 listing due to blackhat SEO spammer hosting).

ZeroLengthPOSTs written at 22:47:37; Add Comment

2013-05-24

My issue with infinite scrolling web pages: the lack of a stopping point

Lately, 'infinite scrolling' web pages have become a popular design technique. These are pages that add additional content as you scroll down; keep scrolling and they'll keep adding, basically forever. While I can see the attraction of infinite scrolling, for me one of its big features is also my biggest issue with it.

One big appeal of infinite scrolling is that the user never has to interrupt their actions to go on. There is no scrolling to the bottom of one page, following a 'next page' link, and then starting again; instead, the same action of scrolling down is used both for moving through current content and 'navigating' to new content. And this is exactly the problem. When there is a distinct 'go to next' action to do, the bottom of a page (or wherever it is) is a natural place for me to consider whether I want to stop or go on. In an infinite scrolling page there is no such natural stopping point.

The alternate way to put this is that paged navigation creates natural chunks for me to consume content in, while infinite scrolling simply points a massive firehose at me. I find it easier to consume content in chunks; the paradoxical result of feeding me a navigation-free firehose is that I actually browse less of it for various reasons. Sometimes when there's no end it also feels like there's no real point.

(This thought was sparked by Flickr's recent redesign, which added infinite scrolling in a number of new places and thereby discouraged me from scrolling much in general.)

(I don't think that this just me being a curmudgeon about change, but I could be wrong. And I'm handwaving away many possible implementation issues and concerns with infinite scrolling. People who want to can find lots of other reasons to dislike it. I doubt it's going away for any number of reasons, including that I suspect it creates much easier navigation on touch interfaces.)

InfiniteScrollingIssue written at 02:01:43; Add Comment

2013-05-23

Why web robots sending Referer headers is wrong

I've written before on my view that web robots of all sorts should never send a Referer header. In those entries I mostly said 'don't do that' without giving a solid philosophical argument about why, so today I feel like changing that.

(Not that a philosophical argument actually matters. Proper behavior on the web is defined by social convention, ie by what lots of other people do and expect, not by arguing with people over what makes sense. Whether or not you agree with a social convention you break it at your peril, and today robots not sending Referer headers is a well established social convention that I will ban you for violating. And anyways the people who should read this never will.)

There are two philosophical reasons why it's wrong for robots to send Referer headers. The first is inherent in what the Referer header means, namely 'I just followed a link from page <X>'. This is a description of human behavior but not really of robot behavior; almost no web robot actually traverses the web in that way, finding links and immediately following them. If you crawl web pages, accumulate links, and then some time later crawl those links, you are not 'following a link' in any conventional sense. Worse, what happens if you discover the same link through multiple source documents? Which document gets 'credit' and appears in Referer?

(Yes, yes, this is not quite the spec definition, which kind of permits the 'I found it here' meaning that robots sometimes use. It is instead the practical definition of the header, as defined by how most everything behaves.)

So, you say, you don't care; you want to use Referer as a kind of 'this is what links to you' field for servers. I can summarize a bunch of problems here by saying that the Referer field is a terrible way to communicate this information to web operators, fundamentally because you are trying to use a side effect of HTTP requests to pass on what may be a huge amount of information. If you actually want to be useful you should make this information available on your own web site where people can see and fetch it in bulk.

Finally, the brutal truth is that 'who links to me' is by far less interesting than 'who is sending human traffic to me (right now)'. By far the most valuable part of Referer is information on where real (human) visitors are coming from, to the extent that it's possible to find this out. Being read by people is the ultimate purpose of most web pages, which makes what places are the source of traffic and active links something of decided interest to us. And this sort of human behavior has very little to do with either robot behavior or what potential links exist out there in the world. Mingling either your robot's actions or a 'helpful' attempt to tell us about the latter is not doing us any favours; rather the contrary, in fact (this is one large reason that I react angrily to robots sending Referer).

(There is also the inconvenient fact that once you're operating a decent sized site you're not likely to really care about who links to you because there will be far too many links out there, most of them in increasingly obscure and unimportant places. The links you do care about are exactly the links that send you significant traffic.)

WhyNoRefererForRobots written at 00:25:17; Add Comment

2013-05-21

Diffbot's bad Referer header

Today a web spider called 'Diffbot' (run by diffbot.com) made a whole bunch of requests here, all of which failed. They failed because, just as it has repeatedly done in the past, it made them all with a Referer header of 'http://news.google.com/' and this behavior long ago led me to ban it entirely from here.

There are a number of things wrong with this header. The first is that, to steal from the old Trix commercials, 'silly robot, the Referer header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get your spider banned.

(I have a philosophical ramble about why this is the correct view, but it's going in another entry.)

The second is that, of course, this Referer value is a flaming lie in two different ways. Diffbot in no way shape or form traveled from news.google.com to the whole collection of URLs here that it attempted to crawl with that Referer header and on top of that, news.google.com does not link to here at all. Diffbot made up the header from whole cloth. I react very badly to web spiders that lie to me at the best of times (even if they aren't spraying junk over my referer logs).

Diffbot and its operators may or may not be legitimate, or at least honest about what they're doing; I have no particular opinions on that. But they are unquestionably operating a web spider that routinely lies. I have no idea why and really, I don't care; I was doing them a favour by letting them crawl me and I can and will withdraw that favour if they irritate me.

(See also my technical requirements for web spiders and my standards for responsible spider behavior.)

(No, I haven't mailed Diffbot's operators about this behavior. Are you kidding? I'm neither crazy nor stupid. On today's Internet, mailing people about issues is for people that you actually trust.)

DiffbotBadReferer written at 23:20:49; Add Comment

By day for May 2013: 21 23 24 31; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.