Wandering Thoughts archives

2018-04-30

Microsoft's Bingbot crawler is on a relative rampage here

For some time, people in various places have been reporting that Microsoft Bing's web crawler is hammering them; for example, Discourse has throttled Bingbot (via). It turns out that Wandering Thoughts is no exception, so I thought I'd generate some numbers on what I'm seeing.

Over the past 11 days (including today), Bingbot has made 40998 requests, amounting to 18% of all requests. In that time it's asked for only 14958 different URLs. Obviously many pages have been requested multiple times, including pages with no changes; the most popular unchanging page was requested almost 600 times. Quite a lot of unchanging pages have been requested several times over this interval (which isn't surprising, since most pages here change only very rarely).

Over this time, Bingbot is the single largest source by user-agent (and the second place source is claimed by a bot that is completely banned; after that come some syndication feed fetchers). For scale, Googlebot has only made 2,800 requests over the past 11 days.

Traffic fluctuates from day to day but there is clearly a steady volume. Traffic for the last 11 days is, going backward from today, 5154 requests, then 2394, 2664, 3855, 1540, 2021, 3265, 7575, 2516, 3592, and finally 6432 requests.

As far as bytes transferred go, Bingbot came in at 119.8 Mbytes over those 11 days. Per day volume is 14.9 Mbytes, then 6.9, 7.3, 11.5, 4.6, 5.8, 8.8, 22.9, 6.7, 10.8, and finally 19.4 Mbytes. On the one hand, the total Bingbot volume by bytes is only 1.5% of my total traffic. On the other hand, syndication feed fetches are about 94% of my volume and if you ignore them and look only at the volume from regular web pages, Bingbot jumps up to 26.9% of the total bytes.

I think that all of this crawling is excessive. It's one thing to want current information; it's another thing to be hammering unchanging pages over and over again. Google has worked out how to get current information with far fewer repeat visits to fewer pages (in part by pulling my syndication feed, presumably using it to drive further crawling). The difference between Google and Bing is especially striking considering that far more people seem to come to Wandering Thoughts from Google searches than come from Bing ones.

(Of course, people coming from Bing could be hiding their Referers far more than people coming from Google do, but I'm not sure I consider that very likely.)

I'm not going to ban Bing(bot), but I certainly do wish I had a useful way to answer their requests very, very slowly in order to discourage them from visiting so much and to be smarter about what they do visit.

BingbotOutOfControl written at 00:06:52; Add Comment

2018-04-27

Some notes on Firefox's current media autoplay settings

I am quite violently against videos ever auto-playing in my regular browser, under basically any circumstances ever (including ones like the videos that Twitter uses for those GIFs that people put in their tweets). I hate it with audio, I hate it without audio, I hate it on the web page I'm currently reading, I hate it on the web page in another tab. I just hate it.

I've traditionally used some number of extensions to control this behavior on prominent offenders like YouTube (in addition to setting various things to 'ask before activating'). When I wrote about my switch to Firefox Quantum, I said that I was experimenting with just turning off Firefox's media.autoplay.enabled preference and it seemed to work. I had to later downgrade that assessment and tinker with additional preferences, and I have finally dug into the code to look at things in more depth. So here are some notes.

First, there appear to be at least two circumstances when a video will request autoplaying. The usual case is when it's embedded in a web page and either its <video> element asks for it or some JavaScript runs that tries to start it. The second case is when at least some .mp4 files are directly loaded as standalone URLs (either in the current page or in a new tab), for example because someone just directly linked to one and you clicked on the link (yes, some people do this). As far as I can tell Firefox follows the same 'can it autoplay' checks for both cases.

The actual Firefox code that implements the autoplay policy checks is pretty short and sort of clear; it's the obvious function in AutoplayPolicy.cpp. As far as I can follow the various bits, it goes like this:

  • if media.autoplay.enabled is true (the default case), the autoplay is immediately allowed. If it's false, we don't reject it immediately; instead, we continue on to make further checks and may still allow autoplay. As a result, the preference is misnamed (likely for historical reasons) and should really be called something like media.autoplay.always_allow.

    (There is currently no Firefox preference that totally and unconditionally disables autoplay under all circumstances.)

  • Pages with WebRTC camera or microphone permissions are allowed to autoplay, presumably so that your video conferencing site works smoothly.

  • if media.autoplay.enabled.user-gestures-needed is false (the default), whether autoplay is allowed or forbidden is then based on the video element is 'blessed' or, I think, if the web page is the currently focused web page that's handling user input (ie, it's not hidden off in a tab or something). As far as blessing goes, the code comments for this say:

    True if user has called load(), seek() or element has started playing before. It's only use for checking autoplay policy[.]

  • if media.autoplay.enabled.user-gestures-needed is true, Firefox checks to see if the video will be playing without sound. If it will be silent, the video is allowed to autoplay, even if it is not in the current tab and you haven't activated it in any way.

    If the video has audio, it's allowed to autoplay if and only if the web page has been activated by what comments call 'specific user gestures', which I think means you clicking something on the web page or typed at it.

This means the behavior of silent videos is different based on whether or not you have m.a.e.user-gestures-needed set. If it's the default false, a silent video in another tab will not autoplay. If you've set it to true to get more control in general, you paradoxically get less control of silent videos; they'll always autoplay, even when they've been opened in another tab that you haven't switched over to yet.

(My current fix for this is to comment out the audio checking portion of that code in my own personal Firefox build, so that silent videos get no special allowances. A slightly better one might be to immediately deny autoplay if EventStateManager::IsHandlingUserInput() is false, then check audio volume; if I'm understanding the code right, this would allow silent video to autoplay only on the current page. Since I don't want videos to ever autoplay, I'm fine with my fix and I may someday try making the entire function just immediately return false.)

Turning off media.autoplay.enabled does cause a certain amount of glitches for me on YouTube, but so far nothing insurmountable; people have reported more problems with other sites (here is one explanation). The Mozilla people are apparently actively working on this area, per bug 1231886, which has quite a number of useful and informative comments (eg), and bug 1420389.

(As far as other video sites go, generally I don't have uMatrix set up to allow them to work in the first place so I just turn to my alternate browser. I only have YouTube set up in my main Firefox because I wind up on it so often and it's relatively easy.)

FirefoxMediaAutoplaySettings written at 01:13:10; Add Comment

2018-04-05

Switching over to Firefox Quantum was relatively painless

As you might have guessed from my very weak excuse in a recent entry, I've been increasingly tempted to switch my primary browser over to Firefox Quantum (from Firefox 56). Not because I knew I had to do it sometime (although that was true), but because I genuinely wanted to be running Quantum; the more I used it in various secondary environments, the more I was okay with it, and I have a tropism towards the new and shiny. Today I gave in to that temptation and switched over both at work and at home. The short summary is that it went reasonably painlessly.

There are things that aren't as good as Firefox 56; the most glaring is that there are any number of annoying places where gestures don't work any more, such as a new blank tab or the error page you get when a network connection times out (I'm used to gesturing up-down to cause a refresh in order to retry the connection). I'm also having the usual issues when Firefox's GUI moves controls that I'm extremely used to (I expect 'refresh' to be at the right side of the URL box, for example). But these are reasonably minor and tolerable (and I'll probably get used to the UI switch in time).

(Perhaps someday Mozilla will figure out a way of letting people very selectively grant more permissions to certain addons, so we can have gestures in more places.)

I don't know if I'm imagining things or not, but Firefox Quantum at least feels faster and more responsive than Firefox 56 did. Of course this is what Mozilla said people would experience, but I browse in an atypical environment that isn't bogged down by all of that JavaScript so I wasn't sure how much of the Quantum speedups would apply to me. Since some of the Firefox improvements are in things like processing CSS, I'm willing to believe that I'm seeing something real here.

(There's also that Firefox Quantum is inherently multiprocess now, whereas I was running Firefox 56 in single-process mode because not all of my addons were e10s compatible.)

While I'm glad that I finally made the switch, I'm also glad that I took so long to make it. Getting to the point where this switch was relatively painless took a bunch of experimentation, testing, research, and a certain amount of hacking. I've also benefited from all of the work that other people have done to develop and improve new Firefox Quantum addons, and the improvements in the WebExtensions API itself that have happened since Firefox 57.

(I've been building Firefox Nightly and trying out things in it for months now, and more recently I've switched various other Firefox instances and used them, starting when I accidentally let my Fedora laptop force-upgrade Firefox despite me nominally having held the Fedora package at Firefox 56.)

PS: I'll admit that I knew I was going to have to do this before too long, as uBlock Origin will be dropping support for its legacy version in early May.

PPS: The one difference from my set of Quantum addons is that I'm experimenting with just turning off media.autoplay.enabled and not installing a 'disable autoplay on Youtube' addon. This seems to work so far.

FirefoxQuantumSwitch written at 01:16:41; Add Comment

2018-04-02

I've retired my filtering HTTP proxy

I've been using a filter HTTP proxy for a very long time; the last time I looked suggested that I'd been using one for almost as long as they've existed. A couple of years ago, I wrote that it was time for me to upgrade the proxy I was using, because it had last been updated in 1998 and was stuck having only HTTP/1.0 and IPv4. In my usual way of not doing anything about pending issues as long as nothing explodes, I did nothing about the issue since that mid-2016 entry until very recently. When I did start to think about it this January, I decided to take a different course entirely, and I've now retired my filtering HTTP proxy and rely purely on in-browser protections.

Two things pushed me into realizing that this was the only sensible position. The first was realizing that any useful filter on the modern Internet was (and is) going to require frequent updates to filter rules. You can do this with a filtering proxy, but you need to find one that uses trustworthy external filtering rules, imports them regularly, and so on. This can be done, in theory, but I don't think anyone is doing it in practice as a canned thing today, and I believe that all of the good filtering rulesets are designed for in-browser usage these days (for the obvious reason that this is by far the biggest pool of users).

The second is the rapid increase in HTTPS. Back in mid 2016 I saw plenty of HTTP usage living on for a great deal of time to come, but that seems like a much less certain bet today for various reasons. HTTPS usage is certainly way up and there's no filtering HTTP proxy in existence that I would even think about allowing to do HTTPS interception. Browsers have a hard enough time doing HTTPS securely, and they have far more people working to make everything work well and safely than proxy authors ever will. If I want to do filtering for HTTPS traffic, and I do, I have to rely on my browser addons to do it. As more and more sites move to HTTPS, I'm going to have to rely on my browser addons more and more for protection.

In summary, any proxy I used would clearly only be a secondary backup for the real protection of my addons (since it wouldn't protect me from HTTPS and probably wouldn't have rules as good as my addons do). Once I realized all of this, I decided to simplify my life by not using any sort of filtering HTTP proxy, and back at the end of January I turned my old faithful Junkbusters daemon off and de-configured it from my primary Firefox. I don't think I've noticed any particular difference in my browsing, which is probably not a surprise since its filtering rules were probably last updated 20 years ago, like the rest of my Junkbuster install.

(It was throwing away HTTP cookies, but I have other solutions for that now.)

More broadly, it seems clear that the future and even present of filtering is inside the browser, primarily (for now) in browser addons. Filtering proxies are yesterday's technology, used before browsers could do this sort of thing natively. Browser addons is where all the development effort is going, which is why filtering proxy software sees less and less frequent updates (Privoxy was last updated in 2016, for example).

I expected to feel a little sad about this simply because I've run a filtering proxy for so long, but if anything I wound up feeling relieved. Junkbuster's various limitations are things I inflicted on myself voluntarily in exchange for its benefits, but I'm unsentimental about being able to do better now. Still, thanks, little program; I suspect you vastly outlived what your authors expected of you.

(I guess I am just a tiny bit sentimental about it.)

NoMoreProxy written at 00:32:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.