Wandering Thoughts archives

2013-10-02

What your User-Agent header should include and why

I wound up having a discussion about this in the context of a feed reader and it caused me to have a realization or two, so I've decided to write up my views on this. All of this is mostly from the perspective of a website operator; there are other ones.

There are three different cases: when you are writing a user agent, when you are writing a web robot, and when you are writing a web robot library (which will be used by possibly many web robot operators). The easiest case is when you're writing a client that will be directly used by real people. Here your User-Agent should identify the software by name and by a URL to your project site and give a general version number. It should not identify the user, either directly by name or indirectly by including additional client fingerprint information such as the platform it's running on. As a side note, your project site should include enough information to convince a suspicious website operator that it is a real client that gets used by real people.

(Some people will object to the version number but I think it's important to include because it lets me either tell people to upgrade because the upgrade fixes a problem or tell you that your latest code has some problem. If you leave the version number out all I can possibly report to your project is 'some version of your software does this bad thing'.)

This is completely different for web robots. For web robots the the User-Agent header must contain a clear identification of both your robot and of who is responsible for its operation, ie the URL of a web page describing who you are, what you do, and so on. There should be readable English on the page and a method of contacting you privately (such as email or a contact form). It is vaguely customary to include a version number but as a website operator I don't care in the least; you might as well always use '/1.0' if you feel a version number is required.

Including this information in your User-Agent is to your benefit because it encourages website operators to investigate and perhaps report some crawling program instead of blocking you out of hand (either by user-agent or by source IPs, or perhaps both). I have much harsher reactions to anonymous robots than I do to ones that are willing to identify themselves. Note that if you're a company running software from your servers that is poking my websites, you're a robot operator. At one level I don't care exactly why you're running the software or how many users it is helping, I still expect it to identify the specific party responsible for itself. Fail to do this and I reach for the block tools.

(And yes, this very much applies to feed reader aggregator sites.)

If you're writing a web robot library you need to somehow force its users to add such a clear identification of themselves into the User-Agent (although including your library's project URL is nice, it is not an identification of the responsible party for the robot that is hitting my site). I'd put this into the library's configuration as a mandatory field or make it an optional setting but with the default value of something like 'UNCONFIGURED, BLOCK THIS ROBOT'. Note that if you supply 'sensible' default values, many of your library's users will never change them.

(If you're writing a web library for use by real clients I wouldn't bother having any default User-Agent or putting your library's identification in. Just provide an API for supplying the user agent information and document what's a good idea to put in there. Make using the API mandatory because otherwise people won't. Putting your library information as well is okay and potentially useful, but your library information alone in the User-Agent is completely useless to website operators because it tells us nothing about what is visiting.)

web/UserAgentContentsView written at 01:07:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.