Wandering Thoughts

2019-11-13

How to make a rather obnoxiously bad web spider the easy way

On Twitter, I recently said:

So @SemanticVisions appears to be operating a rather obnoxiously bad web crawler, even by the low standards of web crawlers. I guess I have the topic of today's techblog entry.

This specific web spider attracted my attention in the usual way, which is that it made a lot of requests from a single IP address and so appeared in the logs as by far the largest single source of traffic on that day. Between 6:45 am local time and 9:40 am local time, it made over 17,000 requests; 4,000 of those at the end got 403s, which gives you some idea of its behavior.

However, mere volume was not enough for this web spider. Instead it elevated itself with a novel new behavior I have never seen before. Instead of issuing a single GET request for each URL it was interested in, it seems to have always issued the following three requests:

[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...]
[11/Nov/2019:06:54:03 -0500] "HEAD /~cks/space/<A-PAGE> HTTP/1.1" [...]
[11/Nov/2019:06:54:04 -0500] "GET /~cks/space/<A-PAGE> HTTP/1.1" [...]

In other words, in immediate succession (sometimes in the same second, sometimes crossing a second boundary as here) it issued two HEAD requests and then a GET request, all for the same URL. For a few URLs, it came back and did the whole sequence all over again a short time later for good measure.

In the modern web, issuing HEAD requests without really good reasons is very obnoxious behavior. Dynamically generated web pages usually can't come up with the reply to a HEAD request short of generating the entire page and throwing away the body. Sometimes this is literally how the framework handles it (via). Issuing a HEAD and then immediately issuing a GET is making the dynamic page generator generate the page for you twice; adding an extra HEAD request is just the icing on the noxious cake.

Of course this web spider was bad in all of the usual ways. It crawled through links it was told not to use, it had no rate limiting and was willing to make multiple requests a second, and it had a User-Agent header that didn't include any URL to explain about the web spider, although at least it didn't ask me to email someone. To be specific, here is the User-Agent header it provided:

Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)

All of the traffic came from the IP address 144.76.198.133, which is a Hetzner IP address and currently resolved to a generic 'clients.your-server.de' name. As I write this, the IP address is listed on the CBL and thus appears in Spamhaus XBL and Zen.

(The CBL lookup for it says that it was detected and listed 17 times in past 28 days, the most recent one being at Tue Nov 12 06:45:00 2019 UTC or so. It also claims a cause of listing, but I don't really believe the CBL's one for this IP; I suspect that this web spider stumbled over the CBL's sinkhole web server somewhere and proceeded to get out its little hammer, just as it did against here.)

PS: Of course even if it was not hammering madly on web servers, this web spider would probably still be a parasite.

WebSpiderRepeatedHEADs written at 22:41:50; Add Comment

My mistake in forgetting how Apache .htaccess files are checked

Every so often I get to have a valuable learning experience about some aspect of configuring and operating Apache. Yesterday I got to re-learn that Apache .htaccess files are checked and evaluated in multiple steps, not strictly top to bottom, directive by directive. This means that certain directives can block some later directives while other later directives still work, depending on what sort of directives they are.

(This is the same as the main Apache configuration file, but it's easy to lose sight of this for various reasons, including that Apache has a complicated evaluation order.)

This sounds abstract, so let me tell you the practical story. Wandering Thoughts sits behind an Apache .htaccess file, which originally was for rewriting the directory hierarchy to a CGI-BIN but then grew to also be used for blocking various sorts of significantly undesirable things. I also have some Apache redirects to fix a few terrible mistakes in URLs that I accidentally made.

(All of this place is indeed run through a CGI-BIN in a complicated setup.)

Over time, my .htaccess grew bigger and bigger as I added new rules, almost always at the bottom of the file (more or less). Things like bad web spiders are mostly recognized and blocked through Apache rewriting rules, but I've also got a bunch of 'Deny from ..' rules because that's the easy way to block IP addresses and IP ranges.

Recently I discovered that a new rewrite-based block that I had added wasn't working. At first I thought I had some aspect of the syntax wrong, but in the process of testing I discovered that some other (rewrite-based) blocks also weren't working, although some definitely were. Specifically, early blocks in my .htaccess were working but not later ones. So I started testing block rules from top to bottom, reading through the file in the process, and came to a line in the middle:

RewriteRule ^(.*)?$ /path/to/cwiki/$1 [PT]

This is my main CGI-BIN rewrite rule, which matches everything. So of course no rewrite-based rules after it were working because the rewriting process never got to them.

You might ask why I didn't notice this earlier. Part of the answer is that not everything in my .htaccess after this line failed to take effect. I had both 'Deny from ...' and 'RedirectMatch' rules after this line, and all of those were working fine; it was only the rewrite-based rules that were failing. So every so often I had the reassuring experience of adding a new block and looking at the access logs to see it immediately rejecting an active bad source of traffic or the like.

(My fix was to move my general rewrite rule to the bottom and then put in a big comment about it, so that hopefully I don't accidentally start adding blocking rules below it again in the future.)

PS: It looks like for a while the only blocks I added below my CGI-BIN rewrite rule were 'Deny from' blocks. Then at some point I blocked a bad source by both IP address and then its (bogus) HTTP referer in a rewrite rule, and at that point the gun was pointed at my foot.

HtaccessOrderingMistake written at 01:07:37; Add Comment

2019-11-02

Using personal ruleset recipes in uMatrix in Firefox

I generally don't run Javascript on websites, and these days I do this with uMatrix. uMatrix requires more fiddling than controlling Javascript with uBlock Origin, but I like its fine grained control of various things (including cookies) and how it can improve my web experience. One of the ways I've started doing that is by exploiting uMatrix's ability to let you define personal ruleset recipes.

Suppose, not hypothetically, that you periodically read technical articles on Medium. These articles frequently use images and often inline snippets of code from Github gists and the like, and unfortunately both of these only render if you turn on enough Javascript (and not just Medium's Javascript). Also unfortunately, Medium's own Javascript does enough annoying things that I don't want to leave it on all of the time; I only want to turn it on when I really need it for an article. I can certainly do this by hand, but it involves an annoying amount of clicking on things and refreshing the page.

But it turns out that we can do better. uMatrix has a thing called ruleset recipes, which are, to quote it:

Ruleset recipes ("recipes" thereafter) are community-contributed rulesets suggested to you when you visit a given web site. Recipes are specific, they contain only the necessary rules to un-break a specific web site or behavior.

There is no community contributed recipe for Medium that I know of, but we can write our own and hook it into uMatrix, provided that we have a website somewhere. Once added to uMatrix, we can enable it temporarily with a couple of clicks and then dump all of its temporary additions later.

First we need to create a text file with the rulesets we want and the necessary rules in them. For my Medium rules, what we need looks like this:

$ cat recipes_cks_en.txt
! uMatrix: Ruleset recipes 1.0
! Title: Chris's additional rulesets for English websites
! Maintainer: Chris Siebenmann
!

Medium no account
   medium.com *
      _ 1st-party script
      _ gist.github.com script

Next we need to put this on a website somewhere. Generally this should be a HTTPS website that you trust, for safety. Having done this we next need to add our recipes URL to uMatrix. This is done by going to the uMatrix dashboard, going to the Assets tab, and then down at the bottom of the 'Ruleset recipes' section you will see an 'Import...' option. Enable it, enter the URL of your recipes, and click 'Apply changes'. There, you're done; your new recipes are now available through uMatrix's regular interface for them, described in the ruleset recipes wiki page.

(You can also see the built in recipes in the Assets tab, or look at them on Github. This will give you an idea of what you can do in your own recipes.)

PS: I haven't tried to contribute my Medium recipe because I have no idea if it's complete or truly good enough. It works for me for the things that I care about, more or less, but I don't care very much about having all of Medium's various peculiarities working correctly (or correctly being blocked).

UMatrixPersonalRulesets written at 23:11:11; Add Comment

2019-11-01

The appeal of text templating systems for generating HTML

One of the things that some people love in web frameworks and other people hate is HTML page generation that's based around some form of evaluation of text-based templates. In systems that I'm familiar with, both Django and Go have such a templating system. From some perspectives, such systems aren't ideal; for example, as I mentioned in my entry on XHTML's implications for page generation, truly text based templating systems can't easily enforce strict correctness in the results. My view is that text templating systems have a deep appeal for good reasons; fundamentally, they're a good match for both our tools and how we often think about and write HTML.

There are two ways to view a HTML document or document fragment. One of them is that it is a bunch of text with some greater or lesser degree of markup; another is that it is a tree of nodes (we can call this the DOM based view). The tree based view is how the browser will actually deal with our HTML, but for people it has two connected problems. First, we mostly lack the tools to deal with trees of nodes, while we have plenty of tools for working with text (I think it's telling that common browser web development tools generally show us the page's DOM tree in a textual representation). Second, we mostly don't create documents as trees or think of them that way; instead we create and view them as some mixture of running text (which we add markup to) and structural blocks (possibly nested ones). The actual tree structure is an emergent property of putting together the marked up text and structural blocks in sequence.

All of this makes a text based templating system very natural. It works well with our text based tools and matches how we write and view HTML. For running text, we can read or write the whole thing at once and mostly skip over the markup. In fact we can write the text first, then go back to add the markup it needs to make it come out right in HTML. It's easy to change our mind if we decide that some bit should or shouldn't be emphasised, or link to something, or whatever; it needs no structural rearrangement, just some markup added, deleted, or changed.

This is not true with current tools for dealing with trees. They are not universal in the way that text is, and they necessarily force structural rearrangement if you decide to change the markup because often changing the markup changes the tree structure. If you add markup, you must split a tree node and often create sub-nodes; if you delete markup, you must merge sub-nodes back in. And if your editing environment insists that the tree is always correct, you often get extra hassles as you write and periodic interruptions of your flow to rearrange your tree. Perhaps someday all of these issues may be fixed, but they aren't today; the tools are just not up to the level and the universality of editors and other things that deal with text.

(All of this should sound very familiar from attempts to make programming languages that aren't represented in text but are instead always dealt with as some form of parse trees.)

HTMLTextTemplateAppeal written at 23:53:11; Add Comment

2019-10-28

One of XHTML's practical problems was its implications for web page generation

I recently ran across The evolution of the web, and a eulogy for XHTML2, which has a much more positive view of XHTML(2) than I do; my view is not positive at all. In the ensuing discussion on lobste.rs I realized a new aspect of the practical problems with XHTML, which is the page creation side.

(My usual XHTML objections focus on the web user side of things, where XHTML's nominal requirement for draconian error handling (any XHTML errors would cause browsers to show you nothing of the page) clashed badly with practical usability, especially as people demonstrably mostly didn't write correct XHTML. A web full of error pages is not a good web.)

Because the consequences of invalid XHTML are so severe, XHTML and the W3C were essentially demanding that everyone change how they created web pages so that they only created valid XHTML. For individually created web pages, ones authored by people (and thus in moderate volume), this is theoretically not a huge problem; people can be pushed to run XHTML validators before they publish, or use XHTML aware editing environments that don't let them make mistakes in the first place.

It is a huge problem for dynamically generated web pages, though, or more exactly for the software that does it. Put simply, text templating is not compatible with XHTML in practice (partly because there are a lot of ways to go wrong in XHTML). At scale, the only safe way to always end up with valid XHTML is to use a page generation API that simply doesn't allow you to do anything other than create valid XHTML. Almost no one generating dynamic pages uses or used such an API, which meant that switching to XHTML would have meant modifying their software at some level.

(A page generation system that throws an error when you generate an invalid XHTML page isn't good enough. From Amazon's perspective, it doesn't matter whether it was the user's browser or their page rendering system that caused a product page to not display; either is bad.)

Since XHTML got web sites nothing in practice, no one of any size was ever likely to do this. And even by the late 00s, more and more web sites were using more and more automatically generated pages. Even today a very large number of automatically generated pages are done through text templating systems, which are and remain very popular in things like (server side) web frameworks.

(I maintain that there are very good reasons for this, but that's for another entry.)

XHTMLAndPageGeneration written at 21:31:32; Add Comment

2019-10-27

An interesting little glitch in how Firefox sometimes handles updates to addons

Every so often I run into a bug where the implementation shows through, or at least it looks like it does. Today's is in Firefox. On Twitter, I said:

It's pretty clear that the Firefox developers don't both leave their Firefox sessions running all the time and use addons that update frequently. I could file a bug, but bleah.

There's an additional condition for this glitch that I forgot to put in my first tweet, which is that you almost certainly need to have addons set to not auto-update.

When you have addons set to not auto-update, about:addons can have a tab for 'Available Updates'. For your convenience, the icon and text for the tab has a count, and if you go to the tab you can see the addons with pending updates and get an option to update each of them. The glitch comes about if a particular addon accumulates more than one pending update before you update it. If it does, the tab's count will never go to zero and disappear until you restart Firefox, even if there are no pending updates for addons left any more.

(Sometimes this happens if you just let Firefox sit for long enough, for example if it's running over a long weekend on your work desktop; sometimes this happens if there's one update that Firefox has auto-detected and then you ask Firefox to 'Check for updates' and it detects a second update to the addon.)

My guess as to how this glitch came about is that the implementation counts detected updates, not addons that have at least one pending update. Every time Firefox detects a pending update, it increases the count, and every time it applies an update it decreases it again. But the problem here is that Firefox only ever updates to the most recent version for an addon even if it has accumulated several new versions, which means that if an addon has multiple updates, the count gets incremented more than it gets decremented. Restarting Firefox causes it to redo everything from scratch, at which point it notices at most one pending update per addon (the most recent update) and the count is correct (for a while).

(In my case I've decided to use the development versions of uBlock Origin and uMatrix as a very small way of helping out in their development. I've never noticed any new glitches or bugs, but maybe someday I'll contribute.)

FirefoxAddonsUpdateGlitch written at 22:19:53; Add Comment

2019-10-18

My little irritation with Firefox's current handling of 'Do-Not-Track'

The Do-Not-Track proposed HTTP feature was either a noble or naive attempt by various people to get websites not to track you if you asked them not to. It worked about as well as you'd expect, which is to say not at all in practice. Allegedly, for a long time having your browser send a DNT header made it easier to fingerprint you because so few people did it that you stood out all the more.

(This may no longer be the case, for reasons we're about to see.)

For a long time, Firefox provided a setting to send or not send a DNT header with requests. Although I already used a variety of Firefox addons and settings to stop being tracked, I turned this setting on basically as a gesture to websites to tell them they had no excuse. I didn't worry about this making me easier to fingerprint, because even without DNT my particular combination of User-Agent and other browser attributes was generally very close to unique (as measured by eg the EFF's Panopticlick).

Recently, two things happened here. The first is that Firefox changed its Do-Not-Track behavior when they added tracking protection as part of their content blocking. After this was added, your two choices with DNT are either sending it all the time or sending it if you have Firefox block tracking; there is no option to have Firefox block tracking but not send a DNT header. At one level this makes perfect sense, but at another level it runs into the the second issue, which is that I found some websites that behave differently in an inconvenient way if DNT is set. Specifically, Medium will block certain embedded content in Medium articles (both on its own site and on sites that just publish with Medium, which is a lot of it), as covered (currently) in Medium's Do Not Track Policy. For me, clicking through often doesn't work very well, so I would like it if Medium didn't do this.

Although it pains me, what I should probably do is turn off Firefox's own tracking protections to whatever degree is required to not trigger this Medium behavior. I'm already relying on uBlock Origin for my anti-tracking protection, so the built in stuff in Firefox is just a backup and may not be doing anything for me in general. Of course, this assumes that I've correctly understood what is going on here with Medium in the first place, because it's always possible that something else about my environment is triggering their 'DNT' stuff (for example, perhaps uBlock Origin is blocking something).

(I was going to be confident about what was going on, but then I started trying to verify that my Firefox was or wasn't sending a DNT header under various circumstances. Now I'm a lot less sure.)

FirefoxDNTIrritation written at 22:43:49; Add Comment

2019-10-14

Googlebot is both quite fast and very determined to crawl your pages

I recently added support to DWiki (the engine behind Wandering Thoughts) to let me more or less automatically generate 'topic' index pages, such as the one on my Prometheus entries. As you can see on that page, the presentation I'm using has links to entries and links to the index page for the days they were posted on. I'm not sure that the link to the day is particularly useful but I feel the page looks better that way, rather than just having a big list of entry titles, and this way you can see how old any particular entry is.

The first version of the code had a little bug that generated bad URLs for the target of those day index page links. The code was only live for about two hours before I noticed and fixed it, and the topic pages didn't appear in the Atom syndication feed, just in the page sidebar (which admittedly appears on every page). Despite that short time being live, in that time Googlebot crawled at least one of the topic pages and almost immediately began trying to crawl the bad day index page URLs, all of which generated 404s.

You can probably guess what happened next. Despite always getting 404s, Googlebot continued trying to crawl various of those URLs for about two weeks afterward. At this point I don't have complete logs, but for the logs that I do have it appears that Googlebot only tried to crawl each URL once; there just were a bunch of them. However, I know that its initial crawling attempts were more aggressive than the tail-off I have in the current logs, so I suspect that each URL was tried at least twice before Googlebot gave up.

(I was initially going to speculate about various things that this might be a sign of, but after thinking about it more I've realized that there really is no way for me to have any good idea of what's going on. So many things could factor into Googlebot's crawling decisions, and I have no idea what is 'normal' for its behavior in general or its behavior on Wandering Thoughts specifically.)

PS: The good news is that Googlebot does appear to eventually give up on bad URLs, or at least bad URLs that have never been valid in the past. This is what you'd hope, but with Googlebot you never know.

GoogleCrawlingPersistence written at 23:15:31; Add Comment

2019-10-05

The wikitext problem with new HTML elements such as <details>

I recently wrote about my interest in HTML5's <details> element. One of the obvious potential places to use <details> (when it becomes well supported) is here on Wandering Thoughts; not only is it the leading place where I create web content, but I also love parenthetical asides (perhaps a little too much) and <details> would be one way to make some of them less obtrusive. Except that there is a little problem in the way, which is that Wandering Thoughts isn't written in straight HTML but instead in a wikitext dialect.

When you have a wik or in general any non-HTML document text that is rendered down to HTML, using new HTML elements is necessarily a two step process. First, you have to figure out what you're going to sensibly use them for, which is the step everyone has to do. But then you have a second step of figuring out how to represent this new HTML element in your non-HTML document text, ideally in a non-hacky way that reflects the resulting HTML structure and requirements (for example, that <details> is an inline 'flow' element, not a block element, which actually surprised me when I looked it up just now).

Some text markup languages allow you to insert arbitrary HTML, which works but is a very blunt hammer; you're basically going to be writing a mix of the markup language and HTML. There probably are markup languages that have extra features to improve this, such as letting you tell them something about the nesting rules and so on for the new HTML elements you're using. My wikitext dialect deliberately has no HTML escapes at all, so I'd have to add some sort of syntax for <details> (or any other new HTML element) before I could use it.

(Life is made somewhat simpler because <details> is a flow element, so it doesn't need any new wikitext block syntax and block parsing. Life is made more difficult because you're going to want to be able to put a lot of content with a lot of markup, links, and so on inside the <details>, which means that certain simplistic approaches aren't good answers in the way they are for, for example, <ABBR>.)

At a sufficiently high level, this is a general tradeoff between having a single general purpose syntax as HTML does (okay, it has a few) and having a bunch of specialized syntaxes. The specialized syntaxes of wikitext have various advantages (for instance, it's a lot faster and easier for me to write this entry in DWikiText than it would be in HTML), but they also lack the easy, straightforward extensibility of the general purpose syntax. If you have a different syntax for everything, adding a new thing needs a new syntax. With HTML, you just need a name (and the semantics).

('Syntax' is probably not quite the right word here.)

HTMLDetailsWikiProblem written at 18:44:04; Add Comment

2019-10-01

My interest in and disappointment about HTML5's new <details> element

Because I checked out from paying attention to HTML's evolution years ago, it took me until very recently to hear about the new <details> element from HTML5. Put simply and bluntly, it's the first new HTML element I've heard of that actually sounds interesting to me. The reason for this is straightforward; it solves a problem that previously might have taken Javascript or at least complex CSS, namely the general issue of having some optional information on a web page that you can reveal or hide.

(That's the surface reason. The deeper reason is that it's the only new HTML5 tag that I've heard of that has actual browser UI behavior associated with it, instead of just semantic meaning.)

Now that I've heard of it, I've started to notice people using it (and I've also started to assume that if I click on the browser UI associated with it, something will actually happen; unfortunately Firefox's current rendering doesn't make it obvious). And when I look around, there are all sorts of things that I might use <details> for, both here on Wandering Thoughts and elsewhere, because optional or additional information is everywhere if you look for it.

(Here on Wandering Thoughts, one form of 'optional' information is comments on blog entries. Currently these live behind a link that you have to click and that loads a separate page, but <details> would let them be inline in the page and revealed more readily. Of course there are various sorts of tradeoffs on that.)

I was all set to make this a very enthusiastic entry, but then I actually looked at the the browser compatibility matrix from MDN and discovered that there is a little problem; <details> is not currently supported in Microsoft Edge (or IE). Edge may not be as popular as it used to be, but I'm not interested in cutting off its users from any of my content (and we can't do that at work). This can be fixed with a Javascript polyfill, but that would require adding Javascript and I'm not that interested.

Given that Edge doesn't support it yet and that IE is out there, it will probably be years before I can assume that <details> just works. Since the 'just works' bit is what makes it attractive to me, I sadly don't think I'm going to be using it any time soon. Oh well.

(HTML5 has also added a number of important input types; I consider these separate from new elements, partly because I had already somewhat heard about them.)

HTMLDetailsNotYet written at 23:24:06; Add Comment

(Previous 10 or go back to September 2019 at 2019/09/20)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.