How to get your syndication feed fetcher at least temporarily banned here

June 24, 2013

In the spirit of a previous series, here's how to get me to at least temporarily ban a syndication feed fetcher that appears potentially legitimate. This is not something that I like to do because it potentially cuts off people who actually want to read Wandering Thoughts, but this case is so bad and so potentially questionable that I'm doing it at least temporarily.

So here's the procedure:

  • Make a lot of requests for the same feed. For example, request the main feed here once every ten minutes like clockwork (despite the fact that it doesn't change anywhere near that often).

  • Don't use any form of conditional GET, so you fetch the full feed every time.

  • Don't support gzip encoding, so you fetch nearly half a megabyte every ten minutes.

  • Insert bogus Cookie headers into the request. In this case the feed fetcher appears to be leaking cookies set by other sites into requests to here, including some badly formed cookies that cause the standard Python cookie parser to throw errors (which get logged by DWiki, which is why I noticed all of this in the first place).

  • Don't have any meaningful reverse DNS and have a User-Agent: header of:
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2; Feeder.co) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31

    This is not a proper User-Agent for an automated feed fetcher. A proper User-Agent clearly identifies the organization responsible and that this is a robotic agent making the request. This is instead an almost complete imitation of a real web browser's User-Agent, with only an inconspicuous 'Feeder.co' to perhaps identify the actual responsible party (there really is a 'feeder.co' and they appear to do feed fetching).

  • Of course the Feeder.co website exposes almost no contact information and especially doesn't have a 'contact us here if our feed fetcher is doing something odd' page.

Under normal circumstances I would continue to allow this feed fetcher to pull my feed and send the people running it email about the problems in the hopes that they'll fix them. But the User-Agent here smells very much like what spammers do and with everything else going on I have no idea if Feeder.co is even responsible for this or whether someone is abusing their vaguely good name. Certainly I don't feel like trusting them with any of my email addresses; even at the best they are running a significantly bad feed fetcher and have made a number of extremely questionable decisions in operating it. It doesn't help that some of their program bugs are drastically polluting my logs (due to the complaints about the malformed cookies).

(If you do not support conditional GET you have absolutely no business polling feeds at a rate anywhere near close to once every ten minutes. Never. Ever.)

(It's not just that the spammers have thoroughly poisoned the well for reaching out to random people on the Internet that you don't have any real knowledge of. It's also that telling people that their software has serious problems is sometimes an excellent way of sparking a great deal of drama (with a capital D). Especially if they are a commercial company.)

PS: I may reluctantly change my opinions here in a few days. I really don't like cutting Wandering Thoughts readers off, even if they are using a service with major problems.

(I've considered redirecting these requests to a very small Atom feed with a single entry that just says 'this feed fetcher is broken and not getting actual content, please switch software or report this to the operators', but that would require creating such a feed somehow. I suppose it wouldn't be too hard. Right now the feed requests are just getting 403 responses (and they are still coming in every ten minutes, which is another failure).)


Comments on this page:

From 138.246.87.57 at 2013-06-24 08:23:26:

Perhaps "429 Too Many Requests" would be more semantic (or "509 Bandwidth Limit Exceeded").

By cks at 2013-06-24 11:34:04:

I'm doing the blocking with Apache access controls and I'm not sure if they give me any control over the actual error code returned. (And my energy level to explore this is low, all things considered, especially since I don't expect the feed fetcher to be paying any attention to the actual error code.)

By Myself at 2015-10-25 18:36:23:

For the record, if a website did that (as in replacing an RSS/etc feed with anything else), I would look into whatever they were saying, remove the feed, and never go there again.

Why? It's simple: there are a number of things I trust websites to be sensible about by default. One of those is "you're open". You start selectively changing things for people, even if it's probably benign, you've taken a first step down a very slippery slope.

By cks at 2015-10-26 02:05:18:

I disagree (of course). Websites are no more obliged to provide service to badly behaved feed fetchers than they are to badly behaved web spiders. And there are certainly both badly behaved feed fetchers and malicious ones (and ones that are somewhere in the middle). I wrote about some aspects of this in the context of web spiders here and here, but much of that applies to feed fetchers as well.

These days, many feed fetchers are operated by individual readers, and I'm willing to go out of my way to support them even when they fumble some aspect of feed fetching. But a certain amount of fetchers are operated by (large) entities for uncertain purposes. When they are operated irresponsibly and in a way that makes it hard to tell whether or not they are malicious, I get annoyed at them. It's 2015. The Internet, the web, web crawling, and feed fetching are all not at all new things. There is a vast reserve of knowledge and plenty of examples out there on how to do this right. If they cannot be bothered to do their research and pay attention, well, that's not a good sign.

(I do not block clearly well intentioned feed fetching services that fumble some aspect of things, like leaking other people's cookies into requests to me. The only time I'm likely to block such people is if they are doing something so bad that it's hammering the server. It is people who are less clearly good who are at steadily increasing risk of blocks.)

By cks at 2015-10-26 02:17:41:

Out of curiosity I took a look at my logs and of course they're still making attempts, more than two years after being banned. The User-Agent: has updated to:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2; Feeder.co) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36

Their fetch attempt interval appears to have decreased to a mere once an hour, repeated endlessly.

However, they've added a new brokenness to their fetching. Feed fetching here uses URLs that end with a '?atom' query string; note the lack of any parameter. Parameter-less query are legal, although somewhat unusual. The requests I'm getting have the query string '?atom=', ie their software has converted the parameter-less query string into something with an empty parameter. This simply doesn't work and is never going to work; it gets rejected by generic code.

(They are not the only people who make this mistake.)

Written on 24 June 2013.
« 'Human error' is not a root cause of problems
Balancing Illumos against ZFS on Linux »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 24 00:40:31 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.