Wandering Thoughts archives


How to make a rather obnoxiously bad web spider the easy way

On Twitter, I recently said:

So @SemanticVisions appears to be operating a rather obnoxiously bad web crawler, even by the low standards of web crawlers. I guess I have the topic of today's techblog entry.

This specific web spider attracted my attention in the usual way, which is that it made a lot of requests from a single IP address and so appeared in the logs as by far the largest single source of traffic on that day. Between 6:45 am local time and 9:40 am local time, it made over 17,000 requests; 4,000 of those at the end got 403s, which gives you some idea of its behavior.

However, mere volume was not enough for this web spider. Instead it elevated itself with a novel new behavior I have never seen before. Instead of issuing a single GET request for each URL it was interested in, it seems to have always issued the following three requests:

[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...]
[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...]
[11/Nov/2019:06:54:04 -0500] "GET /~cks/space/<A-PAGE> HTTP/1.1" [...]

In other words, in immediate succession (sometimes in the same second, sometimes crossing a second boundary as here) it issued two HEAD requests and then a GET request, all for the same URL. For a few URLs, it came back and did the whole sequence all over again a short time later for good measure.

In the modern web, issuing HEAD requests without really good reasons is very obnoxious behavior. Dynamically generated web pages usually can't come up with the reply to a HEAD request short of generating the entire page and throwing away the body. Sometimes this is literally how the framework handles it (via). Issuing a HEAD and then immediately issuing a GET is making the dynamic page generator generate the page for you twice; adding an extra HEAD request is just the icing on the noxious cake.

Of course this web spider was bad in all of the usual ways. It crawled through links it was told not to use, it had no rate limiting and was willing to make multiple requests a second, and it had a User-Agent header that didn't include any URL to explain about the web spider, although at least it didn't ask me to email someone. To be specific, here is the User-Agent header it provided:

Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)

All of the traffic came from the IP address, which is a Hetzner IP address and currently resolved to a generic 'clients.your-server.de' name. As I write this, the IP address is listed on the CBL and thus appears in Spamhaus XBL and Zen.

(The CBL lookup for it says that it was detected and listed 17 times in past 28 days, the most recent one being at Tue Nov 12 06:45:00 2019 UTC or so. It also claims a cause of listing, but I don't really believe the CBL's one for this IP; I suspect that this web spider stumbled over the CBL's sinkhole web server somewhere and proceeded to get out its little hammer, just as it did against here.)

PS: Of course even if it was not hammering madly on web servers, this web spider would probably still be a parasite.

web/WebSpiderRepeatedHEADs written at 22:41:50; Add Comment

My mistake in forgetting how Apache .htaccess files are checked

Every so often I get to have a valuable learning experience about some aspect of configuring and operating Apache. Yesterday I got to re-learn that Apache .htaccess files are checked and evaluated in multiple steps, not strictly top to bottom, directive by directive. This means that certain directives can block some later directives while other later directives still work, depending on what sort of directives they are.

(This is the same as the main Apache configuration file, but it's easy to lose sight of this for various reasons, including that Apache has a complicated evaluation order.)

This sounds abstract, so let me tell you the practical story. Wandering Thoughts sits behind an Apache .htaccess file, which originally was for rewriting the directory hierarchy to a CGI-BIN but then grew to also be used for blocking various sorts of significantly undesirable things. I also have some Apache redirects to fix a few terrible mistakes in URLs that I accidentally made.

(All of this place is indeed run through a CGI-BIN in a complicated setup.)

Over time, my .htaccess grew bigger and bigger as I added new rules, almost always at the bottom of the file (more or less). Things like bad web spiders are mostly recognized and blocked through Apache rewriting rules, but I've also got a bunch of 'Deny from ..' rules because that's the easy way to block IP addresses and IP ranges.

Recently I discovered that a new rewrite-based block that I had added wasn't working. At first I thought I had some aspect of the syntax wrong, but in the process of testing I discovered that some other (rewrite-based) blocks also weren't working, although some definitely were. Specifically, early blocks in my .htaccess were working but not later ones. So I started testing block rules from top to bottom, reading through the file in the process, and came to a line in the middle:

RewriteRule ^(.*)?$ /path/to/cwiki/$1 [PT]

This is my main CGI-BIN rewrite rule, which matches everything. So of course no rewrite-based rules after it were working because the rewriting process never got to them.

You might ask why I didn't notice this earlier. Part of the answer is that not everything in my .htaccess after this line failed to take effect. I had both 'Deny from ...' and 'RedirectMatch' rules after this line, and all of those were working fine; it was only the rewrite-based rules that were failing. So every so often I had the reassuring experience of adding a new block and looking at the access logs to see it immediately rejecting an active bad source of traffic or the like.

(My fix was to move my general rewrite rule to the bottom and then put in a big comment about it, so that hopefully I don't accidentally start adding blocking rules below it again in the future.)

PS: It looks like for a while the only blocks I added below my CGI-BIN rewrite rule were 'Deny from' blocks. Then at some point I blocked a bad source by both IP address and then its (bogus) HTTP referer in a rewrite rule, and at that point the gun was pointed at my foot.

web/HtaccessOrderingMistake written at 01:07:37; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.