What your User-Agent
header should include and why
I wound up having a discussion about this in the context of a feed reader and it caused me to have a realization or two, so I've decided to write up my views on this. All of this is mostly from the perspective of a website operator; there are other ones.
There are three different cases: when you are writing a user agent,
when you are writing a web robot, and when you are writing a web robot
library (which will be used by possibly many web robot operators). The
easiest case is when you're writing a client that will be directly used
by real people. Here your User-Agent
should identify the software
by name and by a URL to your project site and give a general version
number. It should not identify the user, either directly by name or
indirectly by including additional client fingerprint information such
as the platform it's running on. As a side note, your project site
should include enough information to convince a suspicious website
operator that it is a real client that gets used by real people.
(Some people will object to the version number but I think it's important to include because it lets me either tell people to upgrade because the upgrade fixes a problem or tell you that your latest code has some problem. If you leave the version number out all I can possibly report to your project is 'some version of your software does this bad thing'.)
This is completely different for web robots. For web robots the the
User-Agent
header must contain a clear identification of both your
robot and of who is responsible for its operation, ie the URL of a web
page describing who you are, what you do, and so on. There should be
readable English on the page and a method of contacting you privately
(such as email or a contact form). It is vaguely customary to include
a version number but as a website operator I don't care in the least;
you might as well always use '/1.0' if you feel a version number is
required.
Including this information in your User-Agent
is to your benefit
because it encourages website operators to investigate and perhaps
report some crawling program instead of blocking you out of hand (either by user-agent or by source IPs, or perhaps
both). I have much harsher reactions to anonymous robots than I do to
ones that are willing to identify themselves. Note that if you're a
company running software from your servers that is poking my websites,
you're a robot operator. At one level I don't care exactly why you're
running the software or how many users it is helping, I still expect it
to identify the specific party responsible for itself. Fail to do this
and I reach for the block tools.
(And yes, this very much applies to feed reader aggregator sites.)
If you're writing a web robot library you need to somehow force its
users to add such a clear identification of themselves into the
User-Agent
(although including your library's project URL is nice, it
is not an identification of the responsible party for the robot that
is hitting my site). I'd put this into the library's configuration as
a mandatory field or make it an optional setting but with the default
value of something like 'UNCONFIGURED, BLOCK THIS ROBOT'. Note that if
you supply 'sensible' default values, many of your library's users will
never change them.
(If you're writing a web library for use by real clients I wouldn't
bother having any default User-Agent
or putting your library's
identification in. Just provide an API for supplying the user agent
information and document what's a good idea to put in there. Make using
the API mandatory because otherwise people won't. Putting your library
information as well is okay and potentially useful, but your library
information alone in the User-Agent
is completely useless to website
operators because it tells us nothing about what is visiting.)
|
|