How to make a rather obnoxiously bad web spider the easy way
On Twitter, I recently said:
So @SemanticVisions appears to be operating a rather obnoxiously bad web crawler, even by the low standards of web crawlers. I guess I have the topic of today's techblog entry.
This specific web spider attracted my attention in the usual way, which is that it made a lot of requests from a single IP address and so appeared in the logs as by far the largest single source of traffic on that day. Between 6:45 am local time and 9:40 am local time, it made over 17,000 requests; 4,000 of those at the end got 403s, which gives you some idea of its behavior.
However, mere volume was not enough for this web spider. Instead
it elevated itself with a novel new behavior I have never seen
before. Instead of issuing a single
GET request for each URL
it was interested in, it seems to have always issued the following
[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...] [11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...] [11/Nov/2019:06:54:04 -0500] "GET /~cks/space/<A-PAGE> HTTP/1.1" [...]
In other words, in immediate succession (sometimes in the same
second, sometimes crossing a second boundary as here) it issued two
requests and then a
GET request, all for the same URL. For a few
URLs, it came back and did the whole sequence all over again a short
time later for good measure.
In the modern web, issuing
HEAD requests without really good
reasons is very obnoxious behavior. Dynamically generated web pages
usually can't come up with the reply to a
HEAD request short of
generating the entire page and throwing away the body. Sometimes
this is literally how the framework handles it
(via). Issuing a
then immediately issuing a
GET is making the dynamic page generator
generate the page for you twice; adding an extra
HEAD request is
just the icing on the noxious cake.
Of course this web spider was bad in all of the usual ways. It
crawled through links it was told not to use,
it had no rate limiting and was willing to make multiple requests
a second, and it had a User-Agent header that didn't include any
URL to explain about the web spider, although at least it didn't
ask me to email someone. To be specific,
here is the
User-Agent header it provided:
Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)
All of the traffic came from the IP address 220.127.116.11, which is a Hetzner IP address and currently resolved to a generic 'clients.your-server.de' name. As I write this, the IP address is listed on the CBL and thus appears in Spamhaus XBL and Zen.
(The CBL lookup for it says that it was detected and listed 17 times in past 28 days, the most recent one being at Tue Nov 12 06:45:00 2019 UTC or so. It also claims a cause of listing, but I don't really believe the CBL's one for this IP; I suspect that this web spider stumbled over the CBL's sinkhole web server somewhere and proceeded to get out its little hammer, just as it did against here.)
PS: Of course even if it was not hammering madly on web servers, this web spider would probably still be a parasite.