2013-05-31
The mystery of POSTs with a zero Content-Length
One of the joys of running web software that is rather paranoid is
getting to see all sorts of weird things that float around the web,
generally run by spammers and other people who are up to no good.
Today's oddity could be called 'the case of the zero-length POST'
and is just what it sounds like: POST requests that have a
Content-Length of 0 bytes.
(Or at least they have a Content-Length of '0' after Apache gets
through passing the request to DWiki. It's possible that Apache is
silently sanitizing some bizarre C-L header value to 0.)
I don't know the full headers for these POST requests but because
of the code flow inside DWiki, I know that they claim to be form
submissions (I check Content-Type before looking at Content-Length).
It's possible that the requests have some other header that is supposed
to preempt Content-Length. It's also possible that this software is
submitting empty POST form requests to see what happens or because
this evades security precautions in some applications.
Based purely on the claimed User-Agent values I can say that this
software is up to no good, since some of the time it claims to be
Googlebot (or at least some of the requests claim to be from Googlebot,
since I suppose I shouldn't assume that there's just one piece of
software that's doing this). All of the requests I've pulled out of
the logs seem to be HTTP/1.1 requests and generally are for regular
URLs. The software involved also seems to almost always lower-case the
URLs it's using, which very much doesn't work very well here.
(Looking at User-Agent suggests there may be two different programs
involved, one of which claims to be Googlebot and one of which
doesn't sent a User-Agent at all. Only the Googlebot-faker seems
to lowercase its URLs; the other program mostly POSTs to
Wandering Thought's main page but occasionally POSTs to
other, correctly-cased URLs. The second program seems to be the
more active one.)
I don't have any answers to this particular mystery and in fact now that I've looked into it it's more mysterious than before. Sometimes that's how it goes on the web these days.
Sidebar: volume and source details
These requests aren't happening in high volume but are generally happening several times a day from various different IPs. In the last ten days there have been at least 150 instances from 50 different IPs; the most prolific five IPs made 22, 13, 11, 8, and 6 requests each respectively. I haven't tried to run down the origin of all of the IPs, but China shows up a lot in the top-N list. One IP is currently in the SBL, in SBL181621 (a /24 listing due to blackhat SEO spammer hosting).
2013-05-24
My issue with infinite scrolling web pages: the lack of a stopping point
Lately, 'infinite scrolling' web pages have become a popular design technique. These are pages that add additional content as you scroll down; keep scrolling and they'll keep adding, basically forever. While I can see the attraction of infinite scrolling, for me one of its big features is also my biggest issue with it.
One big appeal of infinite scrolling is that the user never has to interrupt their actions to go on. There is no scrolling to the bottom of one page, following a 'next page' link, and then starting again; instead, the same action of scrolling down is used both for moving through current content and 'navigating' to new content. And this is exactly the problem. When there is a distinct 'go to next' action to do, the bottom of a page (or wherever it is) is a natural place for me to consider whether I want to stop or go on. In an infinite scrolling page there is no such natural stopping point.
The alternate way to put this is that paged navigation creates natural chunks for me to consume content in, while infinite scrolling simply points a massive firehose at me. I find it easier to consume content in chunks; the paradoxical result of feeding me a navigation-free firehose is that I actually browse less of it for various reasons. Sometimes when there's no end it also feels like there's no real point.
(This thought was sparked by Flickr's recent redesign, which added infinite scrolling in a number of new places and thereby discouraged me from scrolling much in general.)
(I don't think that this just me being a curmudgeon about change, but I could be wrong. And I'm handwaving away many possible implementation issues and concerns with infinite scrolling. People who want to can find lots of other reasons to dislike it. I doubt it's going away for any number of reasons, including that I suspect it creates much easier navigation on touch interfaces.)
2013-05-23
Why web robots sending Referer headers is wrong
I've written before on my view that web robots of all sorts should
never send a Referer header. In those entries I mostly said 'don't do
that' without giving a solid philosophical argument about why, so today
I feel like changing that.
(Not that a philosophical argument actually matters. Proper behavior
on the web is defined by social convention, ie by what lots of other
people do and expect, not by arguing with people over what makes
sense. Whether or not you agree with a social convention you break it
at your peril, and today robots not sending Referer headers is a
well established social convention that I will ban you for violating. And anyways the people who should read this never
will.)
There are two philosophical reasons why it's wrong for robots to
send Referer headers. The first is inherent in what the Referer
header means, namely 'I just followed a link from page <X>'. This is a
description of human behavior but not really of robot behavior; almost
no web robot actually traverses the web in that way, finding links and
immediately following them. If you crawl web pages, accumulate links,
and then some time later crawl those links, you are not 'following a
link' in any conventional sense. Worse, what happens if you discover
the same link through multiple source documents? Which document gets
'credit' and appears in Referer?
(Yes, yes, this is not quite the spec definition, which kind of permits the 'I found it here' meaning that robots sometimes use. It is instead the practical definition of the header, as defined by how most everything behaves.)
So, you say, you don't care; you want to use Referer as a kind of
'this is what links to you' field for servers. I can summarize a bunch
of problems here by saying that the Referer field is a terrible way
to communicate this information to web operators, fundamentally because
you are trying to use a side effect of HTTP requests to pass on what may
be a huge amount of information. If you actually want to be useful you
should make this information available on your own web site where people
can see and fetch it in bulk.
Finally, the brutal truth is that 'who links to me' is by far less
interesting than 'who is sending human traffic to me (right now)'. By
far the most valuable part of Referer is information on where real
(human) visitors are coming from, to the extent that it's possible
to find this out. Being read by people is
the ultimate purpose of most web pages, which makes what places are the
source of traffic and active links something of decided interest to
us. And this sort of human behavior has very little to do with either
robot behavior or what potential links exist out there in the world.
Mingling either your robot's actions or a 'helpful' attempt to tell us
about the latter is not doing us any favours; rather the contrary, in
fact (this is one large reason that I react angrily to robots sending
Referer).
(There is also the inconvenient fact that once you're operating a decent sized site you're not likely to really care about who links to you because there will be far too many links out there, most of them in increasingly obscure and unimportant places. The links you do care about are exactly the links that send you significant traffic.)
2013-05-21
Diffbot's bad Referer header
Today a web spider called 'Diffbot' (run by diffbot.com) made a whole
bunch of requests here, all of which failed. They failed because, just
as it has repeatedly done in the past, it made them all with a Referer
header of 'http://news.google.com/' and this behavior long ago led me
to ban it entirely from here.
There are a number of things wrong with this header. The first is that,
to steal from the old Trix commercials, 'silly robot, the Referer
header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get
your spider banned.
(I have a philosophical ramble about why this is the correct view, but it's going in another entry.)
The second is that, of course, this Referer value is a flaming lie
in two different ways. Diffbot in no way shape or form traveled from
news.google.com to the whole collection of URLs here that it attempted
to crawl with that Referer header and on top of that, news.google.com
does not link to here at all. Diffbot made up the header from whole
cloth. I react very badly to web spiders that lie to me at the best of
times (even if they aren't spraying junk over my referer logs).
Diffbot and its operators may or may not be legitimate, or at least honest about what they're doing; I have no particular opinions on that. But they are unquestionably operating a web spider that routinely lies. I have no idea why and really, I don't care; I was doing them a favour by letting them crawl me and I can and will withdraw that favour if they irritate me.
(See also my technical requirements for web spiders and my standards for responsible spider behavior.)
(No, I haven't mailed Diffbot's operators about this behavior. Are you kidding? I'm neither crazy nor stupid. On today's Internet, mailing people about issues is for people that you actually trust.)