2017-02-17
robots.txt
is a hint and a social contract between sites and web spiders
I recently read the Archive Team's Robots.txt is a suicide note (via), which
strongly advocates removing your robots.txt
. As it happens, I
have a somewhat different view (including about how sites don't
crash under load any more; we have students who beg to differ).
The simple way to put it is that the things I add to robots.txt
are hints to web spiders. Some of the time they are a hint that
crawling the particular URL hierarchy will not be successful anyways,
for example because the hierarchy requires authentication that the
robot doesn't have. We have inward facing websites
with sections that provide web-based services to local users, and
for that matter we have a webmail system. You can try to crawl those
URLs all day, but you're not getting anywhere and you never will.
Some of the time my robots.txt
entries are a hint that if you
crawl this anyways and I notice, I will use server settings to block
your robot from the entire site, including content that I was letting
you crawl before then. Presumably you would like to crawl some of
the content instead of none of it, but if you feel otherwise, well,
crawl away. The same is true of signals like Crawl-Delay
; you can
decide to ignore these, but if you do our next line of defense is
blocking you entirely. And we will.
(There are other sorts of hints, and for complex URL structures some of the hints of all sorts are delivered through nofollow. Beyond not irritating me, there are good operational reasons to pay attention to this.)
This points to the larger scale view of what robots.txt
is, which
is a social contract between sites and web spiders. Sites say
'respect these limits and we will (probably) not block you further'.
As a direct consequence of this, robots.txt
is also one method
to see whether a web spider is polite and well behaved or whether
it is rude and nasty. A well behaved web spider respects robots.txt
;
a nasty one does not. Any web spider that is crawling URLs that are
blocked in a long-standing robots.txt
is not a nice spider, and you can immediately proceed to whatever
stronger measures you feel like using against such things (up to
and including firewall IP address range bans, if you want).
By the way, it is a feature that robots self-identify themselves
when matching robots.txt
. A honest and polite web spider is
in a better position to know what it is than a site that has
to look at the User-Agent
and other indicators, especially
because people do dangerous things with their user-agent strings. If I ban a bad robot via server settings
and you claim to be sort of like that bad robot for some reason,
I'm probably banning you too as a side effect, and I'm unlikely
to care if that's a misfire; by and large it's your problem.
(With all of this said, the Archive Team has a completely sensible
reason for ignoring robots.txt
and I broadly support them doing
so. They will run into various sorts of problems from time to time
as a result of this, but they know what they're doing so I'm sure
they can sort the problems out.)
Sometimes, firmware updates can be a good thing to do
There are probably places that routinely apply firmware updates to every piece of hardware they have. Oh, sure, with a delay and in stages (rushing into new firmware is foolish), but it's always in the schedule. We are not such a place. We have a long history of trying to do as few firmware updates as possible, for the usual reason; usually we don't even consider it unless we can identify a specific issue we're having that new firmware (theoretically) fixes. And if we're having hardware problems, 'update the firmware in the hope that it will fix things' is usually last on our list of troubleshooting steps; we tacitly consider it down around the level of 'maybe rebooting will fix things'.
I mentioned the other day that we've inherited a 16-drive machine with a 3ware controller care. As far as we know, this machine worked fine for the previous owners in a hardware (controller) RAID-6 configuration across all the drives, but we've had real problems getting it stable for us in a JBOD configuration (we much prefer to use software RAID; among other things, we already know how to monitor and manage that with Ubuntu tools). We had system lockups, problems installing Ubuntu, and under load such as trying to scan a 14-disk RAID-6 array, the system would periodically report errors such as:
sd 2:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
(This isn't even for a disk in the RAID-6 array; sd 2:0:0:0 is one of the mirrored system disks.)
Some Internet searches turned up people saying 'upgrade the firmware'. That felt like a stab in the dark to me, especially if the system had been working okay for the previous owners, but I was getting annoyed with the hardware and the latest firmware release notes did talk about some other things we might want (like support for disks over 2 TB). So I figured out how to do a firmware update and applied the 'latest' firmware (which for our controller dates from 2012).
(Unsurprisingly the controller's original firmware was significantly out of date.)
I can't say that the firmware update has definitely fixed our problems with the controller, but the omens are good so far. I've been hammering on the system for more than 12 hours without a single problem report or hiccup, which is far better than it ever managed before, and some things that had been problems before seem to work fine now.
All of this goes to show that sometimes my reflexive caution about firmware updates is misplaced. I don't think I'm ready to apply all available firmware updates before something goes into production, not even long-standing ones, but I'm certainly now more ready to consider them than I was before (in cases where there's no clear reason to do so). Perhaps I should be willing to consider firmware updates as a reasonably early troubleshooting step if I'm dealing with otherwise mysterious failures.