Browers can't feasibly stop web pages from talking to private (local) IP addresses
[...] The major browsers I've tested — Safari, Chrome, Firefox — all allow web pages to send requests not only to localhost but also to any IP address on your Local Area Network! Can you believe that? I'm both astonished and horrified.
(Johnson mostly means things with private IP addresses, which is the only sense of 'on your local and private network' that can be usefully determined.)
This is a tempting and natural viewpoint, but unfortunately this can't be done in practice without breaking things. To understand this, I'll outline a series of approaches and then explain why they fail or cause problems.
To start with, a browser can't refuse to connect to private IP addresses unless the URL was typed in the URL bar because there are plenty of organizations that use private IP addresses for their internal web sites. Their websites may link to each other, load resources from each other, put each other in iframes, and in general do anything you don't want an outside website to do to your local network, and it is far too late to tell everyone that they can't do this all of a sudden.
It's not sufficient for a browser to just block access by explicit IP address, to stop web pages from poking URLs like 'http://192.168.10.10/...'. If you control a domain name, you can make hosts in that have arbitrary IP addresses, including private IP addresses and 127.0.0.1. Some DNS resolvers will screen these out except for 'internal' domains where you've pre-approved them, but a browser can't assume that it's always going to be behind such a DNS resolver.
(Nor can the browser implement such a resolver itself, because it doesn't know what the valid internal domains even are.)
To avoid this sort of DNS injection, let's say that the browser will only accept private IP addresses if they're the result of looking up hosts in top level domains that don't actually exist. If the browser looks up 'nasty.evil.com' and gets a private IP address, it's discarded; the browser only accepts it if it comes from 'good.nosuchtld'. Unfortunately for this idea, various organizations like to put their internal web sites into private subdomains under their normal domain name, like '<host>.corp.us.com' or '<host>.internal.whoever.net'. Among other reasons to do this, this avoids problems when your private top level domain turns into a real top level domain.
So let's use a security zone model. The browser will divide websites and URLs into 'inside' and 'outside' zones, based on what IP address the URL is loaded from (something that the browser necessarily knows at the time it fetches the contents). An 'inside' page or resource may refer to outside things and include outside links, but an outside page or resource cannot do this with inside resources; going outside is a one-way gate. This looks like it will keep internal organizational websites on private IP addresses working, no matter what DNS names they use. (Let's generously assume that the browser manages to get all of this right and there are no tricky cases that slip by.)
Unfortunately this isn't sufficient to keep places like us working. We have a 'split horizon' DNS setup, where the same DNS name resolves to different IP addresses depending on whether you're inside or outside our network perimeter, and we also have a number of public websites that actually live in private IP address space but that are NAT'd to public IPs by our external firewall. These websites are publicly accessible, get linked to by outside things, and may even have their resources loaded by outside public websites, but if you're inside our network perimeter and you look up their name, you get a private IP address and you have to use this IP address to talk to them. This is exactly an 'outside' host referring to an 'inside' resource, which would be blocked by the security zone model.
If browsers were starting from scratch today, there would probably be a lot of things done differently (hopefully more securely). But they aren't, and so we're pretty much stuck with this situation.
Straightforward web applications are now very likely to be stable in browsers
In response to my entry on how our goals for our web application are to not have to touch it, Ross Hartshorn left a comment noting:
Hi! Nice post, and I sympathize. However, I can't help thinking that, for web apps in particular, it is risky to have the idea of software you don't have to touch anymore (except for security updates). The browsers which are used to access it also change. [...]
I don't think these are one-off changes, I think it's part of a general trend. If it's software that runs on your computer, you can just leave it be. If it's a web app, a big part of it is running on someone else's computer, using their web browser (a piece of software you don't control). You will need to update it from time to time. [...]
This is definitely true in a general, abstract sense, and in the past it has been true in a concrete sense, in that some old web applications could break over time due to the evolution of browsers. However, this hasn't really been an issue for simple web applications (ones just based around straight HTML forms), and these days I think that even straightforward web applications are going to be stable over browser evolution.
The reality of the web is that there is a huge overhang of old straightforward HTML, and there has been for some time; in fact, for a long time now, at any given point in time most of the HTML in existence is 'old' to some degree. Browsers go to great effort to not break this HTML, for the obvious reason, and so any web application built around basic HTML, basic forms, and the like has been stable (in browsers) for a long time now. The same is true for basic CSS, which has long since stopped being in flux and full of quirks. If you stick to HTML and CSS that is at least, say, five years old, everything just works. And you can do a great deal with that level of HTML and CSS.
(One exhibit for this browser stability is DWiki, the very software behind this blog, which has HTML and CSS that mostly fossilized more than a decade ago. This includes the HTML form for leaving comments.)
(This is certainly our experience with our web application.)
Another way to put this is that the web has always had some stable core, and this stable core has steadily expanded over time. For some time now, that stable core has been big enough to build straightforward web applications. It's extremely unlikely that future browsers will roll back very much of this stable core, if anything; it would be very disruptive and unpopular.
(You don't have to build straightforward web applications using the stable core; you can make your life as complicated as you want to. But you're probably not going to do that if you want an app that you can stop paying much attention to.)
Firefox and my views on the tradeoffs of using DNS over HTTPS
For those who have not heard, Mozilla is (still) planning to have Firefox support and likely default to resolving DNS names through DNS over HTTPS using Cloudflare's DoH server (see eg this news article). The alternate, more scary way of putting this is that Mozilla is planning to send all of your DNS lookups (well, for web browsing) to Cloudflare, instead of your own ISP or your own DNS server. People have mixed feelings about Cloudflare, and beyond that issue and the issue of privacy from Cloudflare itself, there is the fact that Cloudflare is a US company, subject to demands by the US government, and the Cloudflare DoH server you wind up using may not be located in your country and thus not covered by laws and regulations that your ISP's DNS service is possibly subject to (such as Europe's GDPR).
Combining this with that fact that today, your large ISP is one of your threats creates a bunch of unhappy tradeoffs for Mozilla for deploying DNS over HTTPS in Firefox. On the one hand, some or many people are being intruded on today with ISP surveillance and even ISP tampering with DNS results, and these people will have their lives improved by switching to DoH from a trustworthy provider. On the other hand, some people will be exposed to additional risks they did not already have by a switch to DoH with Cloudflare, and even for people who were already being intruded on by their ISP, the risks are different.
Pragmatically, it seems likely that turning on DoH by default in Firefox will improve the situation with DNS snooping for many people. Mozilla has a contract with Cloudflare about DNS privacy, which is more than you have with your ISP (for typical people), and Cloudflare's POPs are widely distributed around the world and so are probably in most people's countries (making them at least partially subject to your laws and regulations). I suspect that Mozilla will be making this argument both internally and externally as the rollout approaches, along with 'you can opt out if you want to'.
However, some number of people are not having their DNS queries snooped today, and even when people are having them intruded on, that intrusion is spread widely across the ISP industry world wide instead of concentrated in one single place (Cloudflare). The currently un-snooped definitely have their situation made worse by having their DNS queries sent to Cloudflare, even if the risk of something bad happening is probably low. As for the distributed definite snooping versus centralized possible snooping argument, I don't have any answer. They're both bad, and we don't and can't know whether or not the latter will happen.
I don't pretend to know what Mozilla should do here. I'm not even sure there is a right answer. None of the choices make me happy, nor does the thought the DoH to Cloudflare by default is probably the pragmatically least generally harmful option, the choice that does the most good for the most people even though it harms some people.
To put it another way, I don't think there's any choice that Mozilla can make here that doesn't harm some people through either action or inaction.
(This sort of elaborates on some tweets of mine.)
Feed readers and their interpretation of the Atom 'title' element
My entry yesterday had the title of The HTML <pre> element doesn't do very much, which as you'll notice has a HTML element named in plain text in the title. In the wake of posting the entry, I had a couple of people tell me that their feed reader didn't render the title of my entry correctly, generally silently omitting the '<pre>' (there was a comment on the entry and a report on Twitter). Ironically, this is also what happened in Liferea, my usual feed reader, although that is a known Liferea issue. However, other feed readers display it correctly, such as The Old Reader (on their website) and Newsblur (in the iOS client).
(I read my feed in a surprising variety of syndication feed readers, for various reasons.)
As far as I can tell, my Atom feed is correct. The raw text of my Atom feed for the Atom <title> element is:
<title type="html">The HTML &lt;pre> element doesn't do very much</title>
If the value of "type" is "html", the content of the Text construct MUST NOT contain child elements and SHOULD be suitable for handling as HTML. Any markup within MUST be escaped; for example, "<br>" as "<br>".
The plain text '<pre>' in my title is encoded as '&lt;pre>'. Decoded from Atom-encoded text to HTML, this gives us '<pre>', which is not HTML markup but an encoded plain-text '<pre>' with the starting '<' escaped (as it is rendered repeatedly in the raw HTML of this entry and yesterday's).
(My Atom syndication feed generation encodes '>' to '>' in an excess of caution; as we see from the RFC, it is not strictly required.)
Despite that, many syndication feed readers appear to be doing something wrong. I was going to say that I could imagine several options, but after thinking about it more, I can't really. I know that Liferea's issue apparently at least starts with decoding the 'type="html"' title attribute twice instead of once, but I'm not sure if it then decides to try to strip markup from the result (which would strip out the '<pre>' that the excess decoding has materialized) or if it passes the result to something that renders HTML and so silently swallows the un-closed <pre>. I can imagine a syndication feed reader that correctly decodes the <title> once, but then passes it to a display widget that expects encoded HTML instead of straight HTML. An alternate is that the display widget only accepts plain text and the feed reader made a mistake in the process of trying to transform HTML to plain text where it decodes entities before removing HTML tags instead of the other way around.
(Decoding things more times than you should can be a hard mistake to spot. Often the extra decoding has no effect on most text.)
Since some syndication feed readers get it right and some get it wrong, I'm not sure there's anything I can do to fix this in my feed. I've used an awkward workaround in the title of this entry so that it will be clear even in feed readers, but otherwise I'm probably going to keep on using HTML element names and other awkward things in my titles every so often.
(My titles even contain markup from time to time, which is valid
in Atom feeds but which gives various syndication feed readers some
degree of heartburn. Usually the markup is setting things in
monospace', eg here, although
every once in a while it includes links.)
The HTML <pre> element doesn't do very much
These days I don't do too much with HTML, so every so often I wind
up in a situation where I have to reach back and reconstruct things
that once were entirely well known to me. Today, I wound up talking
with someone about the
<pre> element and what you could and
couldn't safely put in it, and it took some time to remember most
of the details.
The simple version is that <pre> doesn't escape markup, it only changes formatting, although many simple examples you'll see only use it on plain text so it's not immediately clear. Although it would be nice if <pre> was a general container that you could pour almost arbitrary text into and have it escaped, it's not. If you're writing HTML by hand and you have something to put into a <pre>, you need to escape any markup and HTML entities (much like a <textarea>, although even more so). Alternately, you can actually use this to write <pre> blocks that contain markup, for example links or text emphasis (you might deliberately use bold inside a <pre> to denote generic placeholders that the reader fills in with their specifics).
As with <textarea>, it's easy to overlook this
for straightforward cases and to get away without doing any text
escaping, especially in modern browsers. A lot of the command lines
or code or whatever that we often put into <pre> don't contain
things that can be mistaken for HTML markup or HTML entities, and
modern browsers will often silently re-interpret things as plain
text for you if they aren't validly formatted entities or markup.
I myself have written and altered any number of <pre> blocks over
the past few years without ever thinking about it, and I'm sure
that some of them included '
<' or '
>' and perhaps '
as part of Unix command lines).
(The MDN page on <pre> includes an example with unescaped < and >. If you play around with similar cases, you'll probably find that what is rendered intact and what is considered to be an unrecognized HTML element that is silently swallowed is quite sensitive to details of formatting and what is included within the '< ... >' run of raw text. Browsers clearly have a lot of heuristics here, some of which have been captured in HTML5's description of tag open state. In HTML5, anything other than an ASCII alpha after the '<' makes it a non-element (in any context, not just in a <pre>).)
I don't know how browser interpretation of various oddities in <pre> content is affected by the declared or assumed HTML DOCTYPE or HTML version the browser assumes, but I wouldn't count on all of them behaving the same outside, perhaps, of HTML5 mode (which at least has specific rules for this). Of course if you're producing HTML with tools instead of writing it by hand, the tools should take care of this for you. That's the only reason that Wandering Thoughts has whatever HTML correctness it does; my DWikiText to HTML rendering code takes care of it all for me, <pre> blocks included.
I'll start with my toot, slightly shorn of context:
Every so often I wind up viewing a version of the web that isn't filtered by uBlock Origin and my 'allow basically no JS' settings (in my default browser) and oh ow ow ow.
(But 'allow no JS' is basically the crazy person setting and it's only tolerable because I keep a second browser just for JS-required sites. Which throws away all my cookies & stuff every time it shuts down, because my trust is very low once JS is in the picture)
Having two browsers is reasonably easy (provided that you're willing to use both Chrome and Firefox; these days I instead have two instances of Firefox). Arranging to be able to move URLs and links easily back and forth is probably not for most people in most desktop environments. I'm the kind of person who writes scripts and runs a custom window manager environment, so I can blithely describe this as 'not too much work (for me)'.
(You can always select a link in one browser and do 'Copy link location', then start the other browser and paste it into the URL bar. But this is not a fast and fluid approach.)
Firefox versus Chrome (my 2019 view)
On Twitter, I said:
I continue to believe that Firefox is your best browser option, despite the addons screwup. Mozilla at least tries to be good (and usually is), while Chrome is straight up one tentacle of the giant, privacy invading, advertising company giant vampire squid of Google.
I'm sure there are plenty of good, passionate, well-intended people who work on Chrome, and they care a lot about privacy, user choice, and so on. But existing within the giant vampire squid of Google drastically constrains and distorts what outcomes they can possibly obtain.
Mozilla is absolutely not perfect; they have committed technical screwups, made decisions in the aftermath of that that I feel are wrong, and especially they've made trust-betraying policy decisions, which are the worst problem because they infect everything. But fundamentally, Mozilla is trying to be good and I do believe that it still has a general organizational culture that supports that.
Chrome and the people behind it absolutely can do good, especially when they take advantage of their position as a very popular browser to drive beneficial changes. That Chrome is strongly committed to Certificate Transparency is one big reason that it's moving forward, for example, and I have hopes that their recently announced future cookie changes will be a net positive. But Chrome is a mask that Google wears, and regardless of what Google says, it's not interested in either privacy or user choice that threatens its business models. Every so often, this shows through Chrome development in an obvious way, but I have to assume that for everything we see, there are hundreds of less visible decisions and influences that we don't. And then there's Google's corporate tactics (alternate).
Much as in my choice of phones and tablets, I know which side of this I come down on when the dust settles. And I'm sticking with that side, even if there are some drawbacks and some screwups every so often, and some things that make me unhappy.
(At one point I thought that the potential for greater scrutiny of Google's activities with Chrome might restrain Google sufficiently in practice. I can no longer believe this, partly because of what got me to walk away from Chrome. Unless the PR and legal environment gets much harsher for Google, I don't think this is going to be any real restraint; Google will just assume that it can get away with whatever it wants to do, and mostly it will be right.)
Some weird and dubious syndication feed fetching from SBL-listed IPs
For reasons beyond the scope of this entry (partly 'because I could'), I've recently been checking to see if any of the IPs that visit Wandering Thoughts are on the Spamhaus SBL. As a preemptive note, using the SBL to block web access is not necessarily a good idea, as I've found out in the past; it's specifically focused on email, not any other sorts of abuse. However, perhaps you don't want to accept web traffic from networks that Spamhaus has identified as belonging to spammers, and Spamhaus also has the Don't Route Or Peer list (which is included in the SBL), of outright extremely bad networks.
When I started looking, I wasn't particularly surprised to find a fair number of IPs on Spamhaus CSS; in practice, the CSS seems to include a fair number of compromised IPs and doesn't necessarily expire them rapidly. However, I also found a surprising number of IPs listed in other Spamhaus records, almost always for network blocks; from today (so far), I had IPs from SBL443160 (a /22), SBL287739 (a /20 for a ROKSO-listed spammer), and especially SBL201196, which is a /19 on an extended version of Spamhaus's DROP list. These are all pretty much dedicated spam operations, not things that have been compromised or neglected, and as such I feel that they're worth blacklisting entirely.
Then I looked at what the particular IPs from these SBL listings were doing here on Wandering Thoughts, and something really peculiar started emerging. Almost all of the IPs were just fetching my syndication feed, using something that claims to be "rss2email/3.9 (https://github.com/wking/rss2email)" in its User-Agent. Most of them are making a single fetch request a day (often only one in several days), and on top of that I noticed that they often got a HTTP 304 'Not Modified' reply. Further investigation has shown that this is a real and proper 'Not Modified', based on these requests having an If-None-Match header with the syndication feed's current ETag value (since this is a cryptographic hash, they definitely fetched the feed before). Given that these IPs are each only requesting my feed once every several days (at most), their having the correct ETag value means that the people behind this are fetching my feed from multiple IPs across multiple networks and merging the results.
(I haven't looked deeply at the activity of the much more numerous SBL CSS listed IPs, but in spot checks some IPs appear to be entirely legitimate real browsers from real people, people who just have the misfortune to have or have inherited a CSS-listed IP.)
Before I started looking, I would have expected the activity from these bad network blocks to be comment spam attempts (which is part of what has attracted my attention to SBL-listed networks in the past). Instead I can't see any real traces of that; in fact, in the past ten days only one SBL listed IP has come close to trying to leave a comment here, and that was a CSS listing. Instead they seem to be harvesting my syndication feed, for an unknown purpose, and this harvesting appears to be done by some group that is active across multiple and otherwise unrelated bad network blocks.
(Since SBL listings are about email spammers, the obvious speculation here is that these people are scanning syndication feeds to find email addresses for spam purposes. This is definitely a thing in general, so it's possible.)
As a side note, this rss2email User-Agent is actually pretty common here (and right now it's the latest release of the actual project). Only a small fraction of the IPs using it are on the SBL; most of them are real, legitimate feed fetchers. Although I do have a surprisingly large number of IPs using rss2email that only fetched my syndication feed once today and still got a 304 Not Modified (which, in some cases, definitely means that they fetched it earlier from some other IP). Some of those one time fetchers turn out to have been doing this sporadically for some time. It's possible that these SBL-hosted fetchers are actually using rss2email, and now that I think about it I can see a reason why. If you already have an infrastructure for harvesting email addresses from email messages and want to extend it to syndication feeds, turning syndication feeds into email is one obvious and simple approach.
(I think the real moral here is to not turn over rocks because, as usual, disturbing things can be found there.)
The appeal of using plain HTML pages
Once upon a time our local support site was a wiki, for all of the reasons that people make support sites and other things into wikis. Then using a wiki blew up in our faces. You might reasonably expect that we replaced it with a more modern CMS, or perhaps a static site generator of some sort (using either HTML or Markdown for content and some suitable theme for uniform styling). After all, it's a number of interlinked pages that need a consistent style and consistent navigation, which is theoretically a natural fit for any of those.
In practice, we did none of those; instead, our current support
site is that most basic thing, a
bunch of static
.html files sitting in a filesystem (and a static
When we need to, we edit the files with
vi, and there's no
deployment or rebuild process.
(If we don't want to edit the live version, we make a copy of the
.html file to a scratch name and edit the copy, then move it back
into place when done.)
This isn't a solution that works for everyone. But for us at our modest scale, it's been really very simple to work with. We all already know how to edit files and how to write basic HTML, so there's been nothing to learn or to remember about managing or updating the support site (well, you have to remember where its files are, but that's pretty straightforward). Static HTML files require no maintenance to keep a wiki or a CMS or a generator program going; they just sit there until you change them again. And everything can handle them.
I'm normally someone who's attracted to ideas like writing in a
markup language instead of raw HTML and having some kind of templated,
dynamic system (whether it's a wiki, a CMS, or a themed static site
generator), as you can tell from Wandering Thoughts and
DWiki itself. I still think that they make sense at large scale.
But at small scale, if I was doing a handful of HTML pages today,
it would be quite tempting to skip all of the complexity and just
(I'd use a standard HTML layout and structure for all the
files, with CSS to match.)
(This thought is sort of sparked by a question by Pete Zaitcev over on the Fediverse, and then reflecting on our experiences maintaining our support site since we converted it to HTML. In practice I'm probably more likely to update the site now than I was when it was a wiki.)
Private browsing mode versus a browser set to keep nothing on exit
These days, apparently a steadily increasing variety of websites are refusing to let you visit their site if you're in private browsing or incognito mode. These websites are advertising that their business model is invading your privacy (not that that's news), but what I find interesting is that these sites don't react when I visit them in a Firefox that has a custom history setting of 'clear history when Firefox closes'. As far as I can tell this still purges cookies and other website traces as effectively as private browsing mode does, and it has the side benefit for me that Firefox is willing to remember website logins.
(I discovered this difference between the two modes in the aftermath of moving away from Chrome.)
So, this is where I say that everyone should do this instead of using private browsing mode? No, not at all. To be bluntly honest, my solution is barely usable for me, never mind someone who isn't completely familiar with Firefox profiles and capable of wiring up a complex environment that makes it relatively easy to open a URL in a particular profile. Unfortunately Firefox profiles are not particularly usable, so much so that Firefox had to invent an entire additional concept (container tabs) in order to get a reasonably approachable version.
(Plus, of course, Private Browsing/Incognito is effectively a special purpose profile. It's so successful in large part because browsers have worked hard to make it extremely accessible.)
Firefox stores and tracks cookies (and presumably local storage) on a per-container basis, for obvious reasons, but apparently doesn't have per-container settings for how long they last or when they get purged. Your browsing history is global; history entries are not tagged with what container they're from. Mozilla's Firefox Multi-Account Containers addon looks like it makes containers more flexible and usable, but I don't think it changes how cookies work here, unfortunately; if you keep cookies in general, you keep them for all containers.
I don't think you can see what container a given cookie comes from through Firefox's normal Preferences stuff, but you can with addons like Cookie Quick Manager. Interestingly, it turns out that Cookie AutoDelete can be set to be container aware, with different rules for different containers. Although I haven't tried to do this, I suspect that you could set CAD so that your 'default' container (ie your normal Firefox session) kept cookies but you had another container that always threw them away, and then set Multi-Account Containers so that selected annoying websites always opened in that special 'CAD throws away all cookies' container.
(As covered in the Cookie AutoDelete wiki, CAD can't selectively remove Firefox localstorage for a site in only some containers; it's all or nothing. If you've set up a pseudo-private mode container for some websites, you probably don't care about this. It may even be a feature that any localstorage they snuck onto you in another container gets thrown away.)