2023-01-25
You should back up the settings for your Firefox addons periodically
Today I had some unexpected excitement with two of my core Firefox addons in my core browser, where either or both of uBlock Origin and uMatrix apparently stopped working and stopped everything else from working along with them, since they're on the critical path for getting web pages. I eventually got everything working again, but I wound up needing to remove and then reinstall from scratch uBlock Origin and uMatrix, both of which have complex configurations. This is where I discovered that my most recent backups of those settings were from 2020 (and from a different machine). Oops.
(I have full filesystem backups, but as far as I know you can't easily extract an addon's settings from a Firefox profile directory, so I would have had to completely restore my entire Firefox profile.)
Many of my Firefox addons have some sort of configuration settings, and yours probably do too (if you use addons). uMatrix and uBlock Origin have a collection of filtering settings, Foxy Gestures has my gesture customizations, Stylus has a bunch of styles, Cookie AutoDelete knows which cookies I don't want to delete, and so on. All of these would be annoying or painful to have to recreate from scratch, and all of these addons offer a way to back up ('export') and restore ('import') their settings. I've done that before (although not for all of my addons), but up until now I've only been doing it very sporadically, as in once every few years (when my settings for some extensions change much more often).
That's why I say back up your Firefox addon settings every so often. You never know when you may need to remove and then re-install an addon, and you can even do it accidentally (for reasons out of the scope of this entry, I once accidentally removed uBlock Origin). It'll also make it much less painful if you ever have to completely redo your Firefox profile. And you can also use your backups to set up a new instance of Firefox elsewhere, for example on a different machine. Unfortunately you'll have to do this by hand and addons don't have a consistent process for how to do it, which for most people (me included) does get in the way of doing it regularly.
(My initial fediverse post for backups was just about uBlock Origin and uMatrix, which are the addons where things change the most frequently, but the more I thought about it the more I realized it also applied to most of my other ones too, especially Stylus.)
PS: Since I did the Internet searches, see this answer, this answer, and this question and answer for information on where Firefox stores an addon's data (including your settings). The short version of all of those answers is that you probably don't want to try doing this unless you're really desperate to get the data out, although it's technically possible.
2023-01-21
How Prometheus makes good use of the HTTP Accept: header
Over on the Fediverse, Simon Willison asked if the HTTP Accept: header was a good idea, which he later narrowed down to APIs and HTML content, excluding media (video, images, etc). I realized that I knew of a good example for APIs, which is how Prometheus metrics exporters use Accept to determine what format they'll report metrics in (although it turns out that I was a bit wrong in my Fediverse post).
Prometheus metrics exporters are queried ('scraped') by Prometheus and respond with metrics in some format.
Historically there has been more than one format, as sort of covered
in Exposition Formats;
currently there's two text ones (Prometheus native and OpenMetrics) and one binary
one (with some variations). The text based formats are easy to
generate and serve by pretty much anything, while the binary format
is necessary for some new things (and may have been seen as more
efficient in the past). A normal metrics exporter (a 'client' in a
lot of Prometheus jargon) that supports more than one format will
choose which format to reply with based on the query's HTTP
Accept
header,
defaulting to the text based format.
Supporting multiple metrics formats at one URL has a number of advantages, especially since everything can produce and consume one of the text formats. People setting up Prometheus servers and clients don't have to care about what format each of them supports in order to set the scrape URL, as they would if the format was in the URL (eg, '/metrics/promtext' instead of '/metrics'). Prometheus and other scrapers don't have to make multiple requests in order to discover the best format they want to use, the way they would have to if the starting URL simply returned an index of format options. And the format used is ultimately under the control of the client more than the server, so a metrics exporter can shift between formats if it needs to (for example if you reconfigure it to report something only supported in one format).
(Currently the wire formats can be found listed and described a bit in common/expfmt/expfmt.go. A Prometheus server may scrape hundreds or thousands of targets every fifteen to thirty seconds, so extra HTTP requests could hurt.)
I suspect that Prometheus isn't the only HTTP based API using the Accept header to transparently choose the best format option for sending data, or to allow seamless upgrades of the supported formats over time. As a system administrator who doesn't want to have to work out, configure, and update format specific endpoint URLs by hand, I fully support this.
(In practice the result of forcing system administrators to set up format specific URLs by hand is probably that the format used for transport is basically fixed once configured, even if both sides are later upgraded to support a better option. This is especially the case if different clients are updated at different times.)
As a side note, this only really works in a pull model instead of a push one. If you push, it's more difficult to ask the other end what (shared) format it would like you to send. A pull model such as Prometheus's provides a natural way to negotiate this, since the puller sends what formats they can accept and then the data source can pick the one it wants out of that.
2023-01-17
An aggressive, stealthy web spider operating from Microsoft IP space
For a few days, I've been noticing some anomalies in some metrics surrounding Wandering Thoughts, but nothing stood out as particularly wrong and my usual habit of looking at the top IP addresses requesting URLs from here didn't turn up anything. Then today I randomly wound up looking at the user-agents of things making requests here and found something unpleasant under the rock I'd just turned over:
Today I discovered that there appears to be a large scale stealth web crawler operating out of Microsoft IP space with the forged user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15', which I believe is a legitimate UA. Current status: working out how to block this in Apache .htaccess.
By the time I noticed it today this spider had made somewhere over 25,000 requests today in somewhat over twelve hours, or at least with that specific user agent (it's hard to see if it used other ones with all of the volume). It made these requests from over 5,800 different IPs; over 600 of these IPs are on the SBL CSS and one of them is SBL 545445 (a /32 phish server). All of these IP addresses are in various networks in Microsoft's AS 8075, and of course none of them have reverse DNS. As you can tell from the significant number of IPs, most IPs do only a few requests and even the active ones did no more than 20 (today, by the time I cut them off). This is a volume level that will fly under the radar for anyone's per-IP ratelimiting.
(Another person reported a similar experience including the low volume per IP. Also, I assume that there is some Microsoft cloud feature for changing your outgoing IP all the time that this spider is exploiting, as opposed to the spider operator having that many virtual machines churning away in Microsoft's cloud.)
This spider seems to have only shown up about five or six days ago. Before then this user agent has no particular prominence in my logs, but then in the past couple of days it's go up to almost 50,000 requests a day. At that request volume most of it is spidering or re-spidering uselessly duplicated content; Wandering Thoughts doesn't have that many unique pages.
This user agent is for Safari 15.1, which was released more than a year ago (apparently October 27th, 2021, or maybe a few days before), and as such is rather out of date by now. Safari on macOS is up to Safari 16, and Safari 15 was (eventually) updated to 15.6.1. I don't know why this spider picked such an out of date user agent to forge, but it's convenient; any actual person still running Safari 15.1 needs to update it anyway to pick up security fixes.
(For the moment, the best I could do with my eccentric setup here was to block anyone using the user agent. Blocking by IP address range is annoying, seeing as today's lot of IP addresses are spread over 20 /16s.)
Sidebar: On the forging of user agents
On the Fediverse, I was asked if it wasn't the case that all user-agent strings were forged in some sense, since these days they're mostly about a statement of compatibility. My off the cuff answer encapsulates something that I want to repeat here:
There is a widespread de facto standard that spiders, crawlers, and other automated agents must report themselves in their user-agent instead of pretending to be browsers.
To put it one way, humans may impersonate each other, but machines do not get to impersonate humans. Machines who try to are immediately assumed to be up to no good, with ample historical reasons to make such an assumption.
(See also my views on your User-Agent
header should include and
why.)
The other thing about this is that compatibility is a matter for browsers, not spiders. If your spider claims to be 'compatible' with Googlebot, what you're really asking for is any special treatment people give Googlebot.
(Sometimes this backfires, if people are refusing things to Googlebot.)
2023-01-12
A browser tweak for system administrators doing (web) network debugging
As a system administrator (and sometimes an ordinary user of the web), I periodically find myself trying to work out why I or people around here can't connect to some website or, sometimes, a portion of the website doesn't work. It turns out that there's a tweak you can make to Firefox and Chrome (and probably other browsers) that makes this somewhat easier to troubleshoot.
(We once had an incident where Google Cloud Platform stopped talking to some of our IPs. Some websites host only a portion of their assets or web application in GCP, so people could load a website's front pages (hosted outside of GCP) but trying to go further or do things in the web app would fail (when it touched GCP and GCP said 'nope'). Even figuring out what was going on took some people here rather a lot of work.)
Modern web browsers have a 'Web Developer Tools' environment that includes a Network tab that will tell you about the requests the current page is doing. By default the information the Network tab presents is focused on the interests of web developers and so lacks some information that system administrators find very helpful. However, you can customize it, and in particular you can make it also show the (HTTP) protocol being used and the remote IP, which are extremely useful for people like me.
To do this, call up Web Developer Tools with, for example, Ctrl+Shift+I. Switch to the Network tab if you're not already on it, and make a request so that the tab displays some data and you have the header. Right click on the header and turn on the Protocol and the remote IP. Turning on the Scheme is optional (it will probably mostly be 'https') but will let you spot websocket connections if you want to check or verify that you have one. Knowing the HTTP protocol is important these days because HTTP/3 is an entirely different transport and may run into firewall issues that HTTP/2 and HTTP/1.1 don't.
(This isn't relevant if you've turned HTTP/3 off in your browser, but then your users probably don't have it turned off and you may need to emulate their setup.)
In an ideal world there would be a way to get your browser to tell you about all currently open or in-flight network connections, both the low level details of where you're connecting to (and how) and the high level details of what protocol the browser is trying to speak over the connection, what web request it's trying to satisfy, and so on. Firefox has about:networking, but generally this gives only low-level details in a useful form. Chrome has taking and exporting logs in chrome://net-export and there's also chrome://net-internals (but it didn't do much for me), and maybe there's other things lurking in chrome://chrome-urls.
(In Firefox, for example, I can see that Firefox is holding open a HTTPS connection to an AWS IP and periodically doing stuff with it, and tcpdump confirms this, but about:networking won't tell me what host name this is for or what web page it's associated with. This is probably some Mozilla internal service, but finding out that it might be 'push.services.mozilla.com' took an absurd amount of work.)
(All of this was sparked by an issue I incorrectly blamed on HTTP/3, which led me to the Cloudflare blog on how to test HTTP/3, which taught me about the Web Developer Tools trick for the protocol.)
PS: Firefox will at least let you get a global view of (new) network activity that it knows about, via the Network tab of the "Browser Toolbox" (Ctrl+Shift+Alt+I). You want to pick 'Multiprocess (slower)'. I believe this will also let you temporarily disable the cache globally, across all windows and tabs.