Some things I mean when I talk about 'forged HTTP referers'

March 7, 2018

One of the most reliable and often the fastest ways to get me to block people from Wandering Thoughts is to do something that causes my logs to become noisy or useless. One of those things is persistently making requests with inaccurate Referer headers, because I look at my Referer logs on a regular basis. When I talk about this, I'll often use the term 'forged' here, as in 'forged referers' or 'referer-forging web spider'.

(I've been grumpy about this for a long time.)

I have casually used the term 'inaccurate' up there, as well as the strong term 'forged'. But given that the Referer header is informational, explicitly comes with no guarantees, and is fully under the control of the client, what does that really mean? As I to use it, I tend have one of three different meanings in mind.

First, let's say what an accurate referer header is: it's when the referer header value is an honest and accurate representation of what happened. Namely, a human being was on the URL in the Referer header and clicked on a link that sent them to my page, or on the site if you only put the site in the Referer. A blank Referer header is always acceptable, as are at least some Referer headers that aren't URLs if they honestly represent what a human did to wind up on my page.

An inaccurate Referer in the broad sense is any Referer that isn't accurate. There are at least two ways for it to be inaccurate (even if it is a human action). The lesser inaccuracy is if the source URL contains a link to my page, but it doesn't actually represent how the human wound up on my page, it's just a (random) plausible value. Such referers are inaccurate now but could be accurate in another circumstances. The greater inaccuracy is if the source URL doesn't even link to my page, so it would never be possible for the Referer to be accurate. Completely bogus referers are usually more irritating than semi-bogus referers, although this is partly a taste issue (both are irritating, honestly, but one shows you're at least trying).

(I'd like better terms for these two sorts of referers; 'bogus' and 'plausible' are the best I've come up with so far.)

As noted, I will generally call both of these cases 'forged', not just 'inaccurate'. Due to my view that Referer is a human only header, I use 'forged' for basically all referers that are provided by web spiders and the like. I can imagine circumstances when I'd call Referer headers sent by a robot as merely 'inaccurate', but they'd be pretty far out and I don't think I've ever run into them.

The third case and the strongest sense of 'forged' for me is when the Referer header has clearly been selected because the web spider is up to no good. One form of this is Referer spamming (which seems to have died out these days, thankfully). Another form is when whatever is behind the requests looks like it's deliberately picking Referer values to try to evade any security precautions that might be there. A third form is when your software uses the Referer field to advertise yourself in some way, instead of leaving this to the User-Agent field (which has happened, although I don't think I've seen it recently).

(Checking for appropriate Referer values is a weak security precaution that's easy to bypass and not necessarily a good idea, but like most weak security precautions it does have the virtue of making it pretty clear when people are deliberately trying to get around it.)

PS: Similar things apply when I talk about 'forged' other fields, especially User-Agent. Roughly speaking, I'll definitely call your U-A forged if you aren't human and it misleads about what you are. If you're a real human operating a real browser, I consider it your right to use whatever U-A you want to, including completely misleading ones. Since I'm human and inconsistent, I may still call it 'forged' in casual conversation for convenience.


Comments on this page:

So if a spider uses Referer to denote which page led it to the one it’s requesting – i.e. it uses Referer for the header’s specified purpose, but without human involvement – would count as both “forged” and “inaccurate”?

By cks at 2018-03-08 14:51:00:

For the general case of web spiders, I would call this 'forged' in my usual broad sense and then I'd probably wind up hedging my use of 'inaccurate' or just sticking with 'forged' alone. A web spider doing this with good intentions is one of the areas that my use of these terms probably doesn't fit very gracefully, at least with how people are generally going to take them.

(My terminology is strongly biased by my belief that Referer is pretty much only a header for human-driven browsing to use. As a result I consider basically all use of it by spiders as bad in some way; the only question is what to call it. Since spiders (ab)using it makes me grumpy, I tend to wind up using grumpy terminology.)

My terminology is strongly biased by my belief that Referer is pretty much only a header for human-driven browsing to use.

But what brings you to this belief? It seems to me that this article is necessitated by the fact that you hold that belief but haven’t explained why. Probably if it was clear why you thought this, then the reason for your terminological choices would be largely obvious.

Which I suppose is another of saying that I feel like you left out the interesting part. 😊

By cks at 2018-03-08 21:10:54:

Sadly, I've fallen into in my bad habit of being too indirect in my linking and link titles here. I have an older entry on Why web robots sending Referer headers is wrong and linked to it in this entry, but didn't clearly indicate that that was what the link embedded in my text was about.

(I have a general issue with this when I write entries, and I should probably write an entry about it. The short version is that I think I treat too many links as footnotes when some of them are more than that. It's always a temptation because it means I can just hook the link to some handy text that's already there instead of figuring out how to throw in a title or a parenthetical aside or the like.)

By Jukka at 2018-03-10 12:18:05:

A side remark: if the assumption that Referer reflects human behavior holds, then it would be better for privacy reasons if the whole field would be deprecated...

Written on 07 March 2018.
« The lie in Ubuntu source packages (and probably Debian ones as well)
Some questions I have about DDR4 RAM speed and latency in underclocked memory »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 7 23:30:55 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.