How to make a rather obnoxiously bad web spider the easy way
On Twitter, I recently said:
So @SemanticVisions appears to be operating a rather obnoxiously bad web crawler, even by the low standards of web crawlers. I guess I have the topic of today's techblog entry.
This specific web spider attracted my attention in the usual way, which is that it made a lot of requests from a single IP address and so appeared in the logs as by far the largest single source of traffic on that day. Between 6:45 am local time and 9:40 am local time, it made over 17,000 requests; 4,000 of those at the end got 403s, which gives you some idea of its behavior.
However, mere volume was not enough for this web spider. Instead
it elevated itself with a novel new behavior I have never seen
before. Instead of issuing a single
GET request for each URL
it was interested in, it seems to have always issued the following
[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...] [11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...] [11/Nov/2019:06:54:04 -0500] "GET /~cks/space/<A-PAGE> HTTP/1.1" [...]
In other words, in immediate succession (sometimes in the same
second, sometimes crossing a second boundary as here) it issued two
requests and then a
GET request, all for the same URL. For a few
URLs, it came back and did the whole sequence all over again a short
time later for good measure.
In the modern web, issuing
HEAD requests without really good
reasons is very obnoxious behavior. Dynamically generated web pages
usually can't come up with the reply to a
HEAD request short of
generating the entire page and throwing away the body. Sometimes
this is literally how the framework handles it
(via). Issuing a
then immediately issuing a
GET is making the dynamic page generator
generate the page for you twice; adding an extra
HEAD request is
just the icing on the noxious cake.
Of course this web spider was bad in all of the usual ways. It
crawled through links it was told not to use,
it had no rate limiting and was willing to make multiple requests
a second, and it had a User-Agent header that didn't include any
URL to explain about the web spider, although at least it didn't
ask me to email someone. To be specific,
here is the
User-Agent header it provided:
Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)
All of the traffic came from the IP address 126.96.36.199, which is a Hetzner IP address and currently resolved to a generic 'clients.your-server.de' name. As I write this, the IP address is listed on the CBL and thus appears in Spamhaus XBL and Zen.
(The CBL lookup for it says that it was detected and listed 17 times in past 28 days, the most recent one being at Tue Nov 12 06:45:00 2019 UTC or so. It also claims a cause of listing, but I don't really believe the CBL's one for this IP; I suspect that this web spider stumbled over the CBL's sinkhole web server somewhere and proceeded to get out its little hammer, just as it did against here.)
PS: Of course even if it was not hammering madly on web servers, this web spider would probably still be a parasite.
My mistake in forgetting how Apache
.htaccess files are checked
Every so often I get to have a valuable learning experience about
some aspect of configuring and operating Apache. Yesterday I got
to re-learn that Apache
.htaccess files are checked and evaluated
in multiple steps, not strictly top to bottom, directive by directive.
This means that certain directives can block some later directives
while other later directives still work, depending on what sort of
directives they are.
(This is the same as the main Apache configuration file, but it's easy to lose sight of this for various reasons, including that Apache has a complicated evaluation order.)
This sounds abstract, so let me tell you the practical story.
Wandering Thoughts sits behind an Apache
which originally was for rewriting the directory hierarchy to a
CGI-BIN but then grew to also be used for blocking
various sorts of significantly undesirable things. I also have some Apache redirects to
fix a few terrible mistakes in URLs that I accidentally made.
Over time, my
.htaccess grew bigger and bigger as I added new
rules, almost always at the bottom of the file (more or less).
Things like bad web spiders are mostly recognized and blocked through
Apache rewriting rules, but I've also got a bunch of '
..' rules because that's the easy way to block IP addresses and
Recently I discovered that a new rewrite-based block that I had added
wasn't working. At first I thought I had some aspect of the syntax
wrong, but in the process of testing I discovered that some other
(rewrite-based) blocks also weren't working, although some definitely
were. Specifically, early blocks in my
.htaccess were working but
not later ones. So I started testing block rules from top to bottom,
reading through the file in the process, and came to a line in the
RewriteRule ^(.*)?$ /path/to/cwiki/$1 [PT]
This is my main CGI-BIN rewrite rule, which matches everything. So of course no rewrite-based rules after it were working because the rewriting process never got to them.
You might ask why I didn't notice this earlier. Part of the answer
is that not everything in my
.htaccess after this line failed to
take effect. I had both '
Deny from ...' and '
rules after this line, and all of those were working fine; it was
only the rewrite-based rules that were failing. So every so often
I had the reassuring experience of adding a new block and looking
at the access logs to see it immediately rejecting an active bad
source of traffic or the like.
(My fix was to move my general rewrite rule to the bottom and then put in a big comment about it, so that hopefully I don't accidentally start adding blocking rules below it again in the future.)
PS: It looks like for a while the only blocks I added below my
CGI-BIN rewrite rule were '
Deny from' blocks. Then at some point
I blocked a bad source by both IP address and then its (bogus)
HTTP referer in a
rewrite rule, and at that point the gun was pointed at my foot.
Using personal ruleset recipes in uMatrix in Firefox
But it turns out that we can do better. uMatrix has a thing called ruleset recipes, which are, to quote it:
Ruleset recipes ("recipes" thereafter) are community-contributed rulesets suggested to you when you visit a given web site. Recipes are specific, they contain only the necessary rules to un-break a specific web site or behavior.
There is no community contributed recipe for Medium that I know of, but we can write our own and hook it into uMatrix, provided that we have a website somewhere. Once added to uMatrix, we can enable it temporarily with a couple of clicks and then dump all of its temporary additions later.
First we need to create a text file with the rulesets we want and the necessary rules in them. For my Medium rules, what we need looks like this:
$ cat recipes_cks_en.txt ! uMatrix: Ruleset recipes 1.0 ! Title: Chris's additional rulesets for English websites ! Maintainer: Chris Siebenmann ! Medium no account medium.com * _ 1st-party script _ gist.github.com script
Next we need to put this on a website somewhere. Generally this should be a HTTPS website that you trust, for safety. Having done this we next need to add our recipes URL to uMatrix. This is done by going to the uMatrix dashboard, going to the Assets tab, and then down at the bottom of the 'Ruleset recipes' section you will see an 'Import...' option. Enable it, enter the URL of your recipes, and click 'Apply changes'. There, you're done; your new recipes are now available through uMatrix's regular interface for them, described in the ruleset recipes wiki page.
(You can also see the built in recipes in the Assets tab, or look at them on Github. This will give you an idea of what you can do in your own recipes.)
PS: I haven't tried to contribute my Medium recipe because I have no idea if it's complete or truly good enough. It works for me for the things that I care about, more or less, but I don't care very much about having all of Medium's various peculiarities working correctly (or correctly being blocked).
The appeal of text templating systems for generating HTML
One of the things that some people love in web frameworks and other people hate is HTML page generation that's based around some form of evaluation of text-based templates. In systems that I'm familiar with, both Django and Go have such a templating system. From some perspectives, such systems aren't ideal; for example, as I mentioned in my entry on XHTML's implications for page generation, truly text based templating systems can't easily enforce strict correctness in the results. My view is that text templating systems have a deep appeal for good reasons; fundamentally, they're a good match for both our tools and how we often think about and write HTML.
There are two ways to view a HTML document or document fragment. One of them is that it is a bunch of text with some greater or lesser degree of markup; another is that it is a tree of nodes (we can call this the DOM based view). The tree based view is how the browser will actually deal with our HTML, but for people it has two connected problems. First, we mostly lack the tools to deal with trees of nodes, while we have plenty of tools for working with text (I think it's telling that common browser web development tools generally show us the page's DOM tree in a textual representation). Second, we mostly don't create documents as trees or think of them that way; instead we create and view them as some mixture of running text (which we add markup to) and structural blocks (possibly nested ones). The actual tree structure is an emergent property of putting together the marked up text and structural blocks in sequence.
All of this makes a text based templating system very natural. It works well with our text based tools and matches how we write and view HTML. For running text, we can read or write the whole thing at once and mostly skip over the markup. In fact we can write the text first, then go back to add the markup it needs to make it come out right in HTML. It's easy to change our mind if we decide that some bit should or shouldn't be emphasised, or link to something, or whatever; it needs no structural rearrangement, just some markup added, deleted, or changed.
This is not true with current tools for dealing with trees. They are not universal in the way that text is, and they necessarily force structural rearrangement if you decide to change the markup because often changing the markup changes the tree structure. If you add markup, you must split a tree node and often create sub-nodes; if you delete markup, you must merge sub-nodes back in. And if your editing environment insists that the tree is always correct, you often get extra hassles as you write and periodic interruptions of your flow to rearrange your tree. Perhaps someday all of these issues may be fixed, but they aren't today; the tools are just not up to the level and the universality of editors and other things that deal with text.
(All of this should sound very familiar from attempts to make programming languages that aren't represented in text but are instead always dealt with as some form of parse trees.)
One of XHTML's practical problems was its implications for web page generation
I recently ran across The evolution of the web, and a eulogy for XHTML2, which has a much more positive view of XHTML(2) than I do; my view is not positive at all. In the ensuing discussion on lobste.rs I realized a new aspect of the practical problems with XHTML, which is the page creation side.
(My usual XHTML objections focus on the web user side of things, where XHTML's nominal requirement for draconian error handling (any XHTML errors would cause browsers to show you nothing of the page) clashed badly with practical usability, especially as people demonstrably mostly didn't write correct XHTML. A web full of error pages is not a good web.)
Because the consequences of invalid XHTML are so severe, XHTML and the W3C were essentially demanding that everyone change how they created web pages so that they only created valid XHTML. For individually created web pages, ones authored by people (and thus in moderate volume), this is theoretically not a huge problem; people can be pushed to run XHTML validators before they publish, or use XHTML aware editing environments that don't let them make mistakes in the first place.
It is a huge problem for dynamically generated web pages, though, or more exactly for the software that does it. Put simply, text templating is not compatible with XHTML in practice (partly because there are a lot of ways to go wrong in XHTML). At scale, the only safe way to always end up with valid XHTML is to use a page generation API that simply doesn't allow you to do anything other than create valid XHTML. Almost no one generating dynamic pages uses or used such an API, which meant that switching to XHTML would have meant modifying their software at some level.
(A page generation system that throws an error when you generate an invalid XHTML page isn't good enough. From Amazon's perspective, it doesn't matter whether it was the user's browser or their page rendering system that caused a product page to not display; either is bad.)
Since XHTML got web sites nothing in practice, no one of any size was ever likely to do this. And even by the late 00s, more and more web sites were using more and more automatically generated pages. Even today a very large number of automatically generated pages are done through text templating systems, which are and remain very popular in things like (server side) web frameworks.
(I maintain that there are very good reasons for this, but that's for another entry.)
An interesting little glitch in how Firefox sometimes handles updates to addons
Every so often I run into a bug where the implementation shows through, or at least it looks like it does. Today's is in Firefox. On Twitter, I said:
It's pretty clear that the Firefox developers don't both leave their Firefox sessions running all the time and use addons that update frequently. I could file a bug, but bleah.
There's an additional condition for this glitch that I forgot to put in my first tweet, which is that you almost certainly need to have addons set to not auto-update.
When you have addons set to not auto-update,
have a tab for 'Available Updates'. For your convenience, the icon
and text for the tab has a count, and if you go to the tab you can
see the addons with pending updates and get an option to update
each of them. The glitch comes about if a particular addon accumulates
more than one pending update before you update it. If it does, the
tab's count will never go to zero and disappear until you restart
Firefox, even if there are no pending updates for addons left any
(Sometimes this happens if you just let Firefox sit for long enough, for example if it's running over a long weekend on your work desktop; sometimes this happens if there's one update that Firefox has auto-detected and then you ask Firefox to 'Check for updates' and it detects a second update to the addon.)
My guess as to how this glitch came about is that the implementation counts detected updates, not addons that have at least one pending update. Every time Firefox detects a pending update, it increases the count, and every time it applies an update it decreases it again. But the problem here is that Firefox only ever updates to the most recent version for an addon even if it has accumulated several new versions, which means that if an addon has multiple updates, the count gets incremented more than it gets decremented. Restarting Firefox causes it to redo everything from scratch, at which point it notices at most one pending update per addon (the most recent update) and the count is correct (for a while).
(In my case I've decided to use the development versions of uBlock Origin and uMatrix as a very small way of helping out in their development. I've never noticed any new glitches or bugs, but maybe someday I'll contribute.)
My little irritation with Firefox's current handling of 'Do-Not-Track'
proposed HTTP feature was either a noble or naive attempt by various
people to get websites not to track you if you asked them not to.
It worked about as well as you'd expect, which
is to say not at all in practice. Allegedly, for a long time having
your browser send a
DNT header made it easier to fingerprint you
because so few people did it that you stood out all the more.
(This may no longer be the case, for reasons we're about to see.)
For a long time, Firefox provided a setting to send or not send a
DNT header with requests. Although I already used a variety of
Firefox addons and settings to stop being tracked, I turned this
setting on basically as a gesture to websites to tell them they
had no excuse. I didn't worry about this making me easier to
fingerprint, because even without DNT my particular combination
of User-Agent and other browser attributes was generally very
close to unique (as measured by eg the EFF's Panopticlick).
Recently, two things happened here. The first is that Firefox changed its Do-Not-Track behavior when they added tracking protection as part of their content blocking. After this was added, your two choices with DNT are either sending it all the time or sending it if you have Firefox block tracking; there is no option to have Firefox block tracking but not send a DNT header. At one level this makes perfect sense, but at another level it runs into the the second issue, which is that I found some websites that behave differently in an inconvenient way if DNT is set. Specifically, Medium will block certain embedded content in Medium articles (both on its own site and on sites that just publish with Medium, which is a lot of it), as covered (currently) in Medium's Do Not Track Policy. For me, clicking through often doesn't work very well, so I would like it if Medium didn't do this.
Although it pains me, what I should probably do is turn off Firefox's own tracking protections to whatever degree is required to not trigger this Medium behavior. I'm already relying on uBlock Origin for my anti-tracking protection, so the built in stuff in Firefox is just a backup and may not be doing anything for me in general. Of course, this assumes that I've correctly understood what is going on here with Medium in the first place, because it's always possible that something else about my environment is triggering their 'DNT' stuff (for example, perhaps uBlock Origin is blocking something).
(I was going to be confident about what was going on, but then I started trying to verify that my Firefox was or wasn't sending a DNT header under various circumstances. Now I'm a lot less sure.)
Googlebot is both quite fast and very determined to crawl your pages
I recently added support to DWiki (the engine behind Wandering Thoughts) to let me more or less automatically generate 'topic' index pages, such as the one on my Prometheus entries. As you can see on that page, the presentation I'm using has links to entries and links to the index page for the days they were posted on. I'm not sure that the link to the day is particularly useful but I feel the page looks better that way, rather than just having a big list of entry titles, and this way you can see how old any particular entry is.
The first version of the code had a little bug that generated bad URLs for the target of those day index page links. The code was only live for about two hours before I noticed and fixed it, and the topic pages didn't appear in the Atom syndication feed, just in the page sidebar (which admittedly appears on every page). Despite that short time being live, in that time Googlebot crawled at least one of the topic pages and almost immediately began trying to crawl the bad day index page URLs, all of which generated 404s.
You can probably guess what happened next. Despite always getting 404s, Googlebot continued trying to crawl various of those URLs for about two weeks afterward. At this point I don't have complete logs, but for the logs that I do have it appears that Googlebot only tried to crawl each URL once; there just were a bunch of them. However, I know that its initial crawling attempts were more aggressive than the tail-off I have in the current logs, so I suspect that each URL was tried at least twice before Googlebot gave up.
(I was initially going to speculate about various things that this might be a sign of, but after thinking about it more I've realized that there really is no way for me to have any good idea of what's going on. So many things could factor into Googlebot's crawling decisions, and I have no idea what is 'normal' for its behavior in general or its behavior on Wandering Thoughts specifically.)
PS: The good news is that Googlebot does appear to eventually give up on bad URLs, or at least bad URLs that have never been valid in the past. This is what you'd hope, but with Googlebot you never know.
The wikitext problem with new HTML elements such as
I recently wrote about my interest in HTML5's
<details> element. One of the obvious potential places to use
it becomes well supported) is here on Wandering Thoughts;
not only is it the leading place where I create web content, but I
also love parenthetical asides (perhaps a little too much) and
<details> would be one way to make some of them less obtrusive.
Except that there is a little problem in the way, which is that
Wandering Thoughts isn't written in straight HTML but instead
in a wikitext dialect.
When you have a wik or in general any non-HTML document text that
is rendered down to HTML, using new HTML elements is necessarily a
two step process. First, you have to figure out what you're going
to sensibly use them for, which is the step everyone has to do.
But then you have a second step of figuring out how to represent
this new HTML element in your non-HTML document text, ideally in a
non-hacky way that reflects the resulting HTML structure and
requirements (for example, that
<details> is an inline 'flow'
element, not a block element, which actually surprised me when
I looked it up just now).
Some text markup languages allow you to insert arbitrary HTML, which
works but is a very blunt hammer; you're basically going to be
writing a mix of the markup language and HTML. There probably are
markup languages that have extra features to improve this, such as
letting you tell them something about the nesting rules and so on
for the new HTML elements you're using. My wikitext dialect deliberately has no HTML escapes at all, so I'd
have to add some sort of syntax for
<details> (or any other new
HTML element) before I could use it.
(Life is made somewhat simpler because
<details> is a flow element,
so it doesn't need any new wikitext block syntax and block parsing.
Life is made more difficult because you're going to want to be able
to put a lot of content with a lot of markup, links, and so on
<details>, which means that certain simplistic approaches
aren't good answers in the way they are for, for example,
At a sufficiently high level, this is a general tradeoff between having a single general purpose syntax as HTML does (okay, it has a few) and having a bunch of specialized syntaxes. The specialized syntaxes of wikitext have various advantages (for instance, it's a lot faster and easier for me to write this entry in DWikiText than it would be in HTML), but they also lack the easy, straightforward extensibility of the general purpose syntax. If you have a different syntax for everything, adding a new thing needs a new syntax. With HTML, you just need a name (and the semantics).
('Syntax' is probably not quite the right word here.)
My interest in and disappointment about HTML5's new
Because I checked out from paying attention to HTML's evolution
years ago, it took me until very recently to hear about the new
from HTML5. Put simply and bluntly, it's the first new HTML element
I've heard of that actually sounds interesting to me. The reason
for this is straightforward; it solves a problem that previously
general issue of having some optional information on a web page
that you can reveal or hide.
(That's the surface reason. The deeper reason is that it's the only new HTML5 tag that I've heard of that has actual browser UI behavior associated with it, instead of just semantic meaning.)
Now that I've heard of it, I've started to notice people using it
(and I've also started to assume that if I click on the browser UI
associated with it, something will actually happen; unfortunately
Firefox's current rendering doesn't make it obvious). And when I
look around, there are all sorts of things that I might use
for, both here on Wandering Thoughts and elsewhere, because
optional or additional information is everywhere if you look for it.
(Here on Wandering Thoughts, one form of 'optional'
information is comments on blog entries. Currently these live behind
a link that you have to click and that loads a separate page, but
<details> would let them be inline in the page and revealed more
readily. Of course there are various sorts of tradeoffs on that.)
I was all set to make this a very enthusiastic entry, but then I
actually looked at the the browser compatibility matrix from MDN
and discovered that there is a little problem;
<details> is not
currently supported in Microsoft Edge (or IE). Edge may not be
as popular as it used to be, but I'm not interested in cutting off
its users from any of my content (and we can't do that at work).
Given that Edge doesn't support it yet and that IE is out there,
it will probably be years before I can assume that
works. Since the 'just works' bit is what makes it attractive to me,
I sadly don't think I'm going to be using it any time soon. Oh well.
(HTML5 has also added a number of important input types; I consider these separate from new elements, partly because I had already somewhat heard about them.)