Wandering Thoughts archives


robots.txt is a hint and a social contract between sites and web spiders

I recently read the Archive Team's Robots.txt is a suicide note (via), which strongly advocates removing your robots.txt. As it happens, I have a somewhat different view (including about how sites don't crash under load any more; we have students who beg to differ).

The simple way to put it is that the things I add to robots.txt are hints to web spiders. Some of the time they are a hint that crawling the particular URL hierarchy will not be successful anyways, for example because the hierarchy requires authentication that the robot doesn't have. We have inward facing websites with sections that provide web-based services to local users, and for that matter we have a webmail system. You can try to crawl those URLs all day, but you're not getting anywhere and you never will.

Some of the time my robots.txt entries are a hint that if you crawl this anyways and I notice, I will use server settings to block your robot from the entire site, including content that I was letting you crawl before then. Presumably you would like to crawl some of the content instead of none of it, but if you feel otherwise, well, crawl away. The same is true of signals like Crawl-Delay; you can decide to ignore these, but if you do our next line of defense is blocking you entirely. And we will.

(There are other sorts of hints, and for complex URL structures some of the hints of all sorts are delivered through nofollow. Beyond not irritating me, there are good operational reasons to pay attention to this.)

This points to the larger scale view of what robots.txt is, which is a social contract between sites and web spiders. Sites say 'respect these limits and we will (probably) not block you further'. As a direct consequence of this, robots.txt is also one method to see whether a web spider is polite and well behaved or whether it is rude and nasty. A well behaved web spider respects robots.txt; a nasty one does not. Any web spider that is crawling URLs that are blocked in a long-standing robots.txt is not a nice spider, and you can immediately proceed to whatever stronger measures you feel like using against such things (up to and including firewall IP address range bans, if you want).

By the way, it is a feature that robots self-identify themselves when matching robots.txt. A honest and polite web spider is in a better position to know what it is than a site that has to look at the User-Agent and other indicators, especially because people do dangerous things with their user-agent strings. If I ban a bad robot via server settings and you claim to be sort of like that bad robot for some reason, I'm probably banning you too as a side effect, and I'm unlikely to care if that's a misfire; by and large it's your problem.

(With all of this said, the Archive Team has a completely sensible reason for ignoring robots.txt and I broadly support them doing so. They will run into various sorts of problems from time to time as a result of this, but they know what they're doing so I'm sure they can sort the problems out.)

web/RobotsTxtHintAndSocialContract written at 23:16:33; Add Comment

Sometimes, firmware updates can be a good thing to do

There are probably places that routinely apply firmware updates to every piece of hardware they have. Oh, sure, with a delay and in stages (rushing into new firmware is foolish), but it's always in the schedule. We are not such a place. We have a long history of trying to do as few firmware updates as possible, for the usual reason; usually we don't even consider it unless we can identify a specific issue we're having that new firmware (theoretically) fixes. And if we're having hardware problems, 'update the firmware in the hope that it will fix things' is usually last on our list of troubleshooting steps; we tacitly consider it down around the level of 'maybe rebooting will fix things'.

I mentioned the other day that we've inherited a 16-drive machine with a 3ware controller care. As far as we know, this machine worked fine for the previous owners in a hardware (controller) RAID-6 configuration across all the drives, but we've had real problems getting it stable for us in a JBOD configuration (we much prefer to use software RAID; among other things, we already know how to monitor and manage that with Ubuntu tools). We had system lockups, problems installing Ubuntu, and under load such as trying to scan a 14-disk RAID-6 array, the system would periodically report errors such as:

sd 2:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

(This isn't even for a disk in the RAID-6 array; sd 2:0:0:0 is one of the mirrored system disks.)

Some Internet searches turned up people saying 'upgrade the firmware'. That felt like a stab in the dark to me, especially if the system had been working okay for the previous owners, but I was getting annoyed with the hardware and the latest firmware release notes did talk about some other things we might want (like support for disks over 2 TB). So I figured out how to do a firmware update and applied the 'latest' firmware (which for our controller dates from 2012).

(Unsurprisingly the controller's original firmware was significantly out of date.)

I can't say that the firmware update has definitely fixed our problems with the controller, but the omens are good so far. I've been hammering on the system for more than 12 hours without a single problem report or hiccup, which is far better than it ever managed before, and some things that had been problems before seem to work fine now.

All of this goes to show that sometimes my reflexive caution about firmware updates is misplaced. I don't think I'm ready to apply all available firmware updates before something goes into production, not even long-standing ones, but I'm certainly now more ready to consider them than I was before (in cases where there's no clear reason to do so). Perhaps I should be willing to consider firmware updates as a reasonably early troubleshooting step if I'm dealing with otherwise mysterious failures.

sysadmin/FirmwareUpdatesCanBeGood written at 01:27:00; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.