My expectations for responsible spider behavior
My minimum technical requirements for real web spiders are deliberately quite black and white. But there are also a number of more fuzzy things that I expect from responsible web spiders. Bear in mind that these aren't hard and fast rules and I can't give precise numbers and so on.
(As before, this only applies to what I'm calling 'real' or 'legitimate' web spiders; I can't expect any particular behavior from malicious web spiders.)
Disclaimers in place, here's what I expect of responsible web spiders:
robots.txtfrequently and adjust your behavior rapidly, say within no more than two days.
(I do not care what infrastructure you require to do this; the fact that
robots.txtupdates have to propagate around six layers of your internal topology before reaching the crawler logic are your problem, not mine.)
- don't make requests more frequently than one every few seconds or so.
- more importantly, notice when the website is slowing down and slow
down yourself. If the website's response speed is down, this is a
very big clue that your spider should space out requests more.
- don't rapidly re-crawl things that haven't changed. It's reasonable to check a few times just to make sure that what looks like unchanging content really is, but after that spiders should slow down. If you spend months revisiting a page three times a week when it hasn't changed in years, I get peeved.
- URLs that get errors count as unchanged pages. Crawl them a few times to make sure that they stay errors, but after that you should immediately demote them to the bottom of your crawl rates.
- this goes triple if the error you are getting is a 403 error, because you are being told explicitly that this is content you are not allowed to see.
Disclaimer: as before, I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.
(Suggestions of more are welcome; I'm probably missing some obvious ones.)
BitTorrent's protocol is not designed to hide
Every so often, I will hear someone say that Bram Cohen clearly wrote BitTorrent to facilitate piracy (despite any of his claims to the contrary) because it was deliberately designed to frustrate attempts to monitor its traffic. This claim irritates me partly because it is clearly wrong, almost blatantly so.
(Disclaimer: I am talking here about classic BitTorrent, as it was before ISPs started whacking things with hammers and people started reacting.)
There are two important things in a BitTorrent transfer: the peers, the collection of machines exchanging pieces of the file, and the tracker, a machine that tells peers (and would be peers) about each other. Your client joins the swarm by registering itself with the tracker, asks the tracker for a list of IP addresses of other peers, and then talks to them directly to exchange pieces of the file; every so often it sends a status update to the tracker.
(This is classic BitTorrent, where torrents had only a single tracker. Since this made the tracker a single point of failure, people soon extended the .torrent metainfo file format to allow for multiple trackers, and these days there are 'trackerless' versions of the protocol.)
The peer to peer protocol is distinct and easily identified and decoded, and it often uses a relatively narrow range of destination ports (TCP 6881 and up). While the peer to tracker protocol is HTTP, the contents of the requests and replies are quite distinct and should easily be identified by any competent traffic inspection system.
Sometimes people say that BitTorrent is hiding things in one of two ways: it limits the amount of information you can find out about peers, and it limits the amount of information you can find out about a random torrent that some people are exchanging. Both are somewhat misleading charges.
While there is no direct way to get a list of all of the peers in a swarm, you can get relatively close by joining the swarm and then repeatedly asking the tracker for peers. The tracker does have a limit of how many peers it will give out at once, but this is self defense; consider what would happen to its bandwidth if a few badly coded or greedy clients joined a popular swarm and started asking for a list of a few thousand peers. (The tracker also doesn't try to keep track of what peers it's already told you about, so you get a random subset each time.)
While it's true that you can't find out the names of the files being transfered in the torrent, this is because the protocols identify torrents using the SHA1 hash of the torrent meta-information instead of passing around the (much larger) meta-information itself.
(However, the protocol has enough information that a passive eavesdropper can reassemble a complete copy of the data in the correct order.)
Not worrying about distributing the meta-information itself makes BitTorrent different from many other P2P protocols, but it also simplifies its job tremendously. Much like web servers worry about serving pages and leave indexing to search engines, BitTorrent concentrates on efficiently distributing a specific blob of data to peers and leaves the rest of the job to someone else. Among other things, this makes it more flexible.
Hopefully all this has demonstrated how absurd it is to claim that BitTorrent was deliberately designed to hide things. About the only thing it could do to be more obvious (without using more bandwidth or trying to require objectionable non-technical things of trackers) would be to have a registered port for trackers instead of using HTTP.
Sidebar: why requiring metainfo availability is bad
You could try to get around the SHA1 hash issue by requiring that trackers always have the metainfo file for each torrent they serve and be willing to give it out. The problem is that this sets you up for an inevitable clash with private and access-restricted torrents. If trackers must give out metainfo files for their torrents to random third parties, then you cannot have a genuinely private torrent; if you can have private torrents, there is no guarantee that trackers will give nosy third parties metainfo files any more, and you might as well not pretend.
In addition, this complicates trackers significantly, because now they are required to implement a relatively full HTTP server environment and use it to serve files. A standards-compliant HTTP/1.0 server is not trivial, and let's not even think about HTTP/1.1.
(Trackers often do display informational pages, but this not required. You can implement a perfectly conformant tracker that only answers the announce URL and only handles a very limited subset of HTTP.)