2025-01-03
The programmable web browser was and is inevitable
In a comment on my entry on why the modern web is why web browsers can't have nice things, superkuh wrote in part:
In the past it was seen as crazy to open every executable file someone might send you over the internet (be it email, ftp, web, or whatever). But sometime in the 2010s it became not only acceptable, but standard practice to automatically run every executable sent to you by any random endpoint on the internet.
For 'every executable' you should read 'every piece of JavaScript', which is executable code that is run by your browser as a free and relatively unlimited service provided to every web page you visit. The dominant thing restraining the executables that web pages send you is the limited APIs that browsers provide, which is why they provide such limited APIs. This comment sparked a chain of thoughts that led to a thesis.
I believe that the programmable web browser was (and is) inevitable. I don't mean this just in the narrow sense that if it hadn't been JavaScript it would have been Flash or Java applets or Lua or WASM or some other relatively general purpose language that the browser would up providing. Instead, I mean it in a broad and general sense, because 'programmability' of the browser is driven by a general and real problem.
For almost as long as the web has existed, people have wanted to create web pages that had relatively complex features and interactions. They had excellent reasons for this; they wanted drop-down or fold-out menus to save screen space so that they could maximize the amount of space given to important stuff instead of navigation, and they wanted to interactively validate form contents before submission for fast feedback to the people filling them in, and so on. At the same time, browser developers didn't want to (and couldn't) program every single specific complex feature that web page authors wanted, complete with bespoke HTML markup for it and so on. To enable as many of these complex features as possible with as little work on their part as possible, browser developers created primitives that could be assembled together to create more sophisticated features, interactions, layouts, and so on.
When you have a collection of primitives that people are expected to use to create their specific features, interactions, and so on, you have a programming language and a programming environment. It doesn't really matter if this programming language is entirely declarative (and isn't necessarily Turing complete), as in the case of CSS; people have to program the web browser to get what they want.
So my view is that we were always going to wind up with at least one programming language in our web browsers, because a programming language is the meeting point between what web page authors want to have and what browser developers want to provide. The only question was (and is) how good of a programming language (or languages) we were going to get. Or perhaps an additional question was whether the people designing the 'programming language' were going to realize that they were doing so, or if they were going to create one through an accretion of features.
(My view is that CSS absolutely is a programming language in this sense, in that you must design and 'program' it in order to achieve the effects you want, especially if you want sophisticated ones like drop down menus. Modern CSS has thankfully moved beyond the days when I called it an assembly language.)
(This elaborates on a Fediverse post.)
2025-01-01
The modern web is why web browsers don't have "nice things" (platform APIs)
Every so often I read something that says or suggests that the big combined browser and platform vendors (Google, Apple, and to a lesser extent Microsoft) have deliberately limited their browser's access to platform APIs that would put "progressive web applications" on par with native applications. While I don't necessarily want to say that these vendors are without sin, in my view this vastly misses the core reason web browsers have limited and slow moving access to platform APIs. To put it simply, it's because of what the modern web has turned into, namely "a hive of scum and villainy" to sort of quote a famous movie.
Any API the browser exposes to web pages is guaranteed to be used by bad actors, and this has been true for a long time. Bad actors will use these APIs to track people, to (try to) compromise their systems, to spy on them, or basically for anything that can make money or gain information. Many years ago I said this was why native applications weren't doomed and basically nothing has changed since then. In particular, browsers are no better at designing APIs that can't be abused or blocking web pages that abuse these APIs, and they probably never will be.
(One of the problems is the usual one in security; there are a lot more attackers than there are browser developers designing APIs, and the attackers only have to find one oversight or vulnerability. In effect attackers are endlessly ingenious while browser API designers have finite time they can spend if they want to ship anything.)
The result of this is that announcements of new browser APIs are greeted not with joy but with dread, because in practice they will mostly be yet another privacy exposure and threat vector (Chrome will often ship these APIs anyway because in practice as demonstrated by their actions, Google mostly doesn't care). Certainly there are some web sites and in-browser applications that will use them well, but generally they'll be vastly outnumbered by attackers that are exploiting these APIs. Browser vendors (even Google with Chrome) are well aware of these issues, which is part of why they create and ship so few APIs and often don't give them very much power.
(Even native APIs are increasingly restricted, especially on mobile devices, because there are similar issues on those. Every operating system vendor is more and more conscious of security issues and the exposures that are created for malicious applications.)
You might be tempted to say that the answer is forcing web pages to ask for permission to use these APIs. This is a terrible idea for at least two reasons. The first reason is alert (or question) fatigue; at a certain point this becomes overwhelming and people stop paying attention. The second reason is that people generally want to use websites that they're visiting, and if faced with a choice between denying a permission and being unable to use the website or granting the permission and being able to use the website, they will take the second choice a lot of the time.
(We can see both issues in effect in mobile applications, which have similar permissions requests and create similar permissions fatigue. And mobile applications ask for permissions far less often than web pages often would, because most people visit a lot more web pages than they install applications.)
2024-12-19
Short lived TLS certificates and graceful rollover in web servers
One of the bits of recent TLS news is that Let's Encrypt is going to start offering 6-day TLS certificates. One of the things that strikes me about this is that various software, web servers included, may finally be motivated to handle changed TLS certificates in a better way than is common today, because TLS certificates will be changing much more frequently.
A lot of programs that use TLS certificates, web servers included, have historically (and currently) not actually 'handled' changing TLS certificates as such. Instead they loaded TLS certificates on startup and then to change the TLS certificates, you either restarted them entirely or notified them to 'reload' everything. A general restart or reload often has side effects; ongoing connections (for things like WebSockets) might get closed, requests might be abruptly ended, and during a restart some requests would get 'connection refused' results. Beyond this, even a reload is traditionally a global thing, where more or less your entire configuration is updated, not just TLS certificates. If an error or a significant change has snuck into the web server configuration but been latent without a restart or a reload, your TLS certificate rollover is about to activate it.
(This also applies to changes you're in the middle of doing. At the moment, TLS certificate renewal is so infrequent that most people can basically ignore the possibility that it will be triggered while you're doing some other work on your web server configuration. In an environment where TLS certificates roll over every few days and your TLS certificate renewal automation may well run every few hours, this is perhaps not so unlikely any more.)
My impression is that web servers have generally handled TLS certificates this way because it was the easiest option. They didn't have automatic hot reloading of TLS certificates any more than they had automatic hot reloading of anything else, nor did they have fine grained manual reloading of specific elements of the configuration. The operation people wanted almost all of the time was either 'restart the server' or at least 'reload all of the configuration', and it happened infrequently enough that you mostly didn't worry about the side effects of this.
(If you were running a web server environment that was big enough to care you built or at least ran special software to gracefully put redundant web servers in and out of service. Such software might support on the fly switching of TLS certificates without interruptions.)
In my view, automatic hot reloading of TLS certificate isn't ideal; since TLS certificates for web servers typically involve multiple files, there are some tricky issues involved. Instead, what I hope web servers add is specific on-command reloading of some or all TLS certificates, in the same way that many DNS servers can be told to reload a specific DNS zones. This would allow TLS certificate rollovers to have only narrow, tightly scoped changes on web servers and hopefully to do this with little or no interruptions to their activities.
2024-12-18
Browser feed reader addons don't seem to do very well on caching
Over on the Fediverse, I said something disappointing:
Browser addons for syndication feed fetching appear to be the absolute worst for frequent feed fetching and ignoring everything the server says about this. They'll ignore Cache-Control hints for Atom syndication feeds, ignore HTTP 429 errors, ignore any retry timing in said headers (not surprising), and keep trying every few minutes. I am sorely disappointed.
(Or at least I assume this is from addons, based on the user-agent strings.)
It's dangerous to assume too much from HTTP user agent strings in this day and age, but many of the user-agent strings that exhibit this behavior are plausible browser ones, often for current versions of the browser, and they often come from what appear to be 'end user' IP addresses, instead of things like cloud server IPs. Firefox is the dominant browser represented in these user-agent strings, although Chrome and Safari also show up; however, there are lots of possible explanations for this, including that perhaps RSS addons are more popular in Firefox than in other browsers.
(If I was an energetic person like rachelbythebay I would try out a bunch of feed reader addons for Firefox to try to identify the flawed ones. I'm not that energetic.)
You'd certainly hope that browser feed reader addons would benefit from general browser cache management and so on, but apparently not very much. Some addons appear to be at least managing conditional requests even if they don't respect feed fetching timing information exposed in Cache-Control headers, but other sources don't even manage that, and will hammer Wandering Thoughts with unconditional GET requests every few minutes. I don't think any of them react to HTTP 429 responses, or at least if they do it's drowned out by all of the ones that clearly don't (some of them have been getting 429s for an extended length of time and are still showing up every few minutes).
I don't know to what extent this is simply coding decisions in the
addons and to what extent it's that browser APIs don't necessarily
make it easy to do the right thing. However, it appears that the
modern fetch()
API defaults
to respecting 'Cache-Control: max-age=...' information, although
perhaps addon authors are forcing either the no-cache or
the reload cache
mode. If I understand things right, the no-cache mode would create
the constant flood of conditional GET requests, while the 'reload'
mode would create the constant unconditional GET requests.
(I don't know if there's any browser API support that more or less automatically handles a Retry-After header value on HTTP 429 errors, or if addons would have to do that entirely themselves (which means that they most likely don't).)
PS: It's possible to do this correctly even with very basic tools, such as curl, as covered in MacKenzie's Fetching RSS Feeds Respectfully With curl (via the author emailing me, which was great, it's a nice piece of work).
2024-11-20
Thinking about how to tame the interaction of conditional GET and caching
Due to how I do caching here, Wandering Thoughts has a long standing weird HTTP behavioral quirk where a non-conditional GET for a syndication feed here can get a different answer than a conditional GET. One (technical) way to explain this issue is that the cache validity interval for non-conditional GETs is longer than the cache validity interval for conditional GETs. In theory this could be the complete explanation of the issue, but in practice there's another part to it, which is that DWiki doesn't automatically insert responses into the cache on a cache miss.
(The cache is normally only filled for responses that were slow to generate, either due to load or because they're expensive. Otherwise I would rather dynamically generate the latest version of something and not clutter up cache space.)
There are various paths that I could take, but which ones I want to take depends on what my goals are and I'm actually not entirely certain about that. If my goal is to serve responses to unconditional GETs that are as fresh as possible but come from cache for as long as possible, what I should probably do is make conditional GETs update the cache when the cached version of the feed exists and would still have been served to an unconditional GET. I've already paid the cost to dynamically generate the feed, so I might as well serve it to unconditional GET requests. However, in my current cache architecture this would have the side effect of causing conditional GETs to get that newly updated cached copy for the conditional GET cache validity period, instead of generating the very latest feed dynamically (what would happen today).
(A sleazy approach would be to backdate the newly updated cache entry by the conditional GET validity interval. My current code architecture doesn't allow for that, so I can avoid the temptation.)
On the other hand, the entire reason I have a different (and longer) cache validity interval for unconditional GET requests is that in some sense I want to punish them. It's a deliberate feature that unconditional GETs receive stale responses, and in some sense the more stale the response the better. Even though updating the cache with a current response I've already generated is in some sense free, doing it cuts against this goal, both in general and in specific. In practice, Wandering Thoughts sees frequent enough conditional GETs for syndication feeds that making conditional GETs refresh the cached feed would effectively collapse the two cache validity intervals into one, which I can already do without any code changes. So if this is my main goal for cache handling of unconditional GETs of my syndication feed, the current state is probably fine and there's nothing to fix.
(A very approximate number is that about 15% of the syndication feed requests to Wandering Thoughts are unconditional GETs. Some of the offenders should definitely know and do better, such as 'Slackbot 1.0'.)
2024-11-10
Syndication feed fetchers and their behavior on HTTP 429 status responses
For reasons outside of the scope of this entry, recently I've been looking at the behavior of syndication feed fetchers here on Wandering Thoughts (which are generally from syndication feed readers), and in the process I discovered some that were making repeated requests at a quite aggressive rate, such as every five minutes. Until recently there was some excuse for this, because I wasn't setting a 'Cache-Control: max-age=...' header (also), which is (theoretically) used to tell Atom feed fetchers how soon they should re-fetch. I feel there was not much of an excuse because no feed reader should default to fetching every five minutes, or even every fifteen, but after I set my max-age to an hour there definitely should be no excuse.
Since sometimes I get irritated with people like this, I arranged to start replying to such aggressive feed featchers with a HTTP 429 "Too Many Requests" status response (the actual implementation is a hack because my entire software is more or less stateless, which makes true rate limiting hard). What I was hoping for is that most syndication feed fetching software would take this as a signal to slow down how often it tried to fetch the feed, and I'd see excessive sources move from one attempt every five minutes to (much) slower rates.
That basically didn't happen (perhaps this is no surprise). I'm sure there's good syndication feed fetching software that probably would behave that way on HTTP 429 responses, but whatever syndication feed software was poking me did not react that way. As far as I can tell from casually monitoring web access logs, almost no mis-behaving feed software paid any attention to the fact that it was specifically getting a response that normally means "you're doing this too fast". In some cases, it seems to have caused programs to try to fetch even more than before.
(Perhaps some of this is because I didn't add a 'Retry-After' header to my HTTP 429 responses until just now, but even without that, I'd expect clients to back off on their own, especially after they keep getting 429s when they retry.)
Given the HTTP User-Agents presented by feed fetchers, some of this is more or less expected, for two reasons. First, some of the User-Agents are almost certainly deliberate lies, and if a feed crawler is going to actively lie about what it is there's no reason for it to respect HTTP 429s either. Second, some of the feed fetching is being done by stateless programs like curl, where the people building ad-hoc feed fetching systems around them would have to go (well) out of their way to do the right thing. However, a bunch of the aggressive feed fetching is being done by either real feed fetching software with a real user-agent (such as "RSS Bot" or the Universal Feed Parser) or by what look like browser addons running in basically current versions of Firefox. I'd expect both of these to respect HTTP 429s if they're programmed decently. But then, if they were programmed decently they probably wouldn't be trying every five minutes in the first place.
(Hopefully the ongoing feed reader behavior project by rachelbythebay will fix some of this in the long run; there are encouraging signs, as covered in eg the October 25th score report.)
2024-10-30
Keeping your site accessible to old browsers is non-trivial
One of the questions you could ask about whether or not to block HTTP/1.0 requests is what this does to old browsers and your site's accessibility to (or from) them (see eg the lobste.rs comments on my entry). The reason one might care about this is that old systems can usually only use old browsers, so to keep it possible to still use old systems you want to accommodate old browsers. Unfortunately the news there is not really great, and taking old browsers and old systems seriously has a lot of additional effects.
The first issue is that old systems generally can't handle modern TLS and don't recognize modern certificate authorities, like Let's Encrypt. This situation is only going to get worse over time, as websites increasingly require TLS 1.2 or better (and then in the future, TLS 1.3 or better). If you seriously care about keeping your site accessible to old browsers, you need to have a fully functional HTTP version. Increasingly, it seems that modern browsers won't like this, but so far they're willing to put up with it. I don't know if there's any good way to steer modern visitors to your HTTPS version instead of your HTTP version.
(This is one area where modern browsers preemptively trying HTTPS may help you.)
Next, old browsers obviously only support old versions of CSS, if they have very much CSS support at all (very old browsers probably won't). This can present a real conflict; you can have an increasingly basic site design that sticks within the bounds of what will render well on old browsers, or you can have one that looks good to what's probably the vast majority of your visitors and may or may not degrade gracefully on old browsers. Your CSS, if any, will probably also be harder to write, and it may be hard to test how well it actually works on old browsers. Some modern accessibility features, such as adjusting to screen sizes, may be (much) harder to get. If you want a multi-column layout or a sidebar, you're going to be back in the era of table based layouts (which this blog has never left, mostly because I'm lazy). And old browsers also mean old fonts, although with fonts it may be easier to degrade gracefully down to whatever default fonts the browser has.
(If you use images, there's the issue of image sizes and image formats. Old browsers are generally used on low resolution screens and aren't going to be the fastest or the best at scaling images down, if you can get them to do it as well. And you need to stick to image formats that they support.)
It's probably not impossible to do all of this, and you can test some of it by seeing how your site looks in text mode browsers like Lynx (which only really supports HTTP/1.0, as it turns out). But's certainly constraining; you have to really care, and it will cut you off from some things that are important and useful.
PS: I'm assuming that if you intend to be as fully usable as possible by old browsers, you're not even going to try to have JavaScript on your site.
2024-10-28
The question of whether to still allow HTTP/1.0 requests or block them
Recently, I discovered something and noted it on the Fediverse:
There are still a small number of things making HTTP/1.0 requests to my techblog. Many of them claim to be 'Chrome/124.<something>'. You know, I don't think I believe you, and I'm not sure my techblog should still accept HTTP/1.0 requests if all or almost all of them are malicious and/or forged.
The pure, standards-compliant answer to this is that of course you should still allow HTTP/1.0 requests. It remains a valid standard, and apparently some things may still default to it, and one part of the web's strength is its backward compatibility.
The pragmatic answer starts with the observation that HTTP/1.1 is now 25 years old, and any software that is talking HTTPS to you is demonstrably able to deal with standards that are more recent than that (generally much more recent, as sites require TLS 1.2 or better). And as a practical matter, pure HTTP/1.0 clients can't talk to many websites because such websites are name-based virtual hosts where the web server software absolutely requires a HTTP Host header before it will serve the website to you. If you leave out the Host header, at best you will get some random default site, perhaps a stub site.
(In a HTTPS context, web servers will also require TLS SNI and some will give you errors if the HTTP Host doesn't match the TLS SNI or is missing entirely. These days this causes HTTP/0.9 requests to be not very useful.)
If HTTP/1.0 requests were merely somewhere between a partial lie (in that everything that worked was actually supplying a Host header too) and useless (for things that didn't supply a Host), you could simply leave them be, especially if the volume was low. But my examination suggests strongly that approximately everything that is making HTTP/1.0 requests to Wandering Thoughts is actually up to no good; at a minimum they're some form of badly coded stealth spiders, quite possibly from would-be comment spammers that are trawling for targets. On a spot check, this seems to be true of another web server as well.
(A lot of the IPs making HTTP/1.0 requests provide claimed User-Agent headers that include ' Not-A.Brand/99 ', which appears to have been a Chrome experiment in putting random stuff in the User-Agent header. I don't see that in modern real Chrome user-agent strings, so I believe it's been dropped or de-activated since then.)
My own answer is that for now at least, I've blocked HTTP/1.0 requests to Wandering Thoughts. I'm monitoring what User-Agents get blocked, partly so I can perhaps exempt some if I need to, and it's possible I'll rethink the block entirely.
(Before you do this, you should certainly look at your own logs. I wouldn't expect there to be very many real HTTP/1.0 clients still out there, but the web has surprised me before.)
2024-10-26
The importance of name-based virtual hosts (websites)
I recently read Geoff Huston's The IPv6 Transition, which is actually about why that transition isn't happening. A large reason for that is that we've found ways to cope with the shortage of IPv4 addresses, and one of the things Huston points to here is the introduction of the TLS Server Name Indicator (SNI) as drastically reducing the demand for IPv4 addresses for web servers. This is a nice story, but in actuality, TLS SNI was late to the party. The real hero (or villain) in taming what would otherwise have been a voracious demand for IPv4 addresses for websites is the HTTP Host header and the accompanying idea of name-based virtual hosts. TLS SNI only became important much later, when a mass movement to HTTPS hosts started to happen, partly due to various revelations about pervasive Internet surveillance.
In what is effectively the pre-history of the web, each website had to have its own IP(v4) address (an 'IP-based virtual host', or just your web server). If a single web server was going to support multiple websites, it needed a bunch of IP aliases, one per website. You can still do this today in web servers like Apache, but it has long since been superseded with name-based virtual hosts, which require the browser to send a Host: header with the other HTTP headers in the request. HTTP Host was officially added in HTTP/1.1, but I believe that back in the days basically everything accepted it even for HTTP 1.0 requests and various people patched it into otherwise HTTP/1.0 libraries and clients, possibly even before HTTP/1.1 was officially standardized.
(Since HTTP/1.1 dates from 1999 or so, all of this is ancient history by now.)
TLS SNI only came along much later. The Wikipedia timeline suggests the earliest you might have reasonably been able to use it was in 2009, and that would have required you to use a bleeding edge Apache; if you were using an Apache provided by your 'Long Term Support' Unix distribution, it would have taken years more. At the time that TLS SNI was introduced this was okay, because HTTPS (still) wasn't really seen as something that should be pervasive; instead, it was for occasional high-importance sites.
One result of this long delay for TLS SNI is that for years, you were forced to allocate extra IPv4 addresses and put extra IP aliases on your web servers in order to support multiple HTTPS websites, while you could support all of your plain-HTTP websites from a single IP. Naturally this served as a subtle extra disincentive to supporting HTTPS on what would otherwise be simple name-based virtual hosts; the only websites that it was really easy to support were ones that already had their own IPs (sometimes because they were on separate web servers, and sometimes for historical reasons if you'd been around long enough, as we had been).
(For years we had a mixed tangle of name-based and ip-based virtual hosts, and it was often difficult to recover the history of just why something was ip-based instead of name-based. We eventually managed to reform it down to only a few web servers and a few IP addresses, but it took a while. And even today we have a few virtual hosts that are deliberately ip-based for reasons.)
2024-10-17
Syndication feed readers now seem to leave Last-Modified values alone
A HTTP conditional GET is a way for web clients, such as syndication feed readers, to ask for a new copy of a URL only if the URL has changed since they last fetched it. This is obviously appealing for things, like syndication feed readers, that repeatedly poll URLs that mostly don't change, although syndication feed readers not infrequently get parts of this wrong. When a client makes a conditional GET, it can present an If-Modified-Since header, an If-None-Match header, or both. In theory, the client's If-None-Match value comes from the server's ETag, which is an opaque value, and the If-Modified-Since comes from the server's Last-Modified, which is officially a timestamp but which I maintain is hard to compare except literally.
I've long believed and said that many clients treat the If-Modified-Since header as a timestamp and so make up their own timestamp values; one historical example is Tiny Tiny RSS, and another is NextCloud-News. This belief led me to consider pragmatic handling of partial matches for HTTP conditional GET, and due to writing that entry, it also led me to actually instrument DWiki so I could see when syndication feed clients presented If-Modified-Since timestamps that were after my feed's Last-Modified. The result has surprised me. Out of the currently allowed feed fetchers, almost no syndication feed fetcher seems to present its own, later timestamp in requests, and on spot checks, most of them don't use too-old timestamps either.
(Even Tiny Tiny RSS may have changed its ways since I last looked at its behavior, although I'm keeping my special hack for it in place for now.)
Out of my reasonably well behaved, regular feed fetchers (other than Tiny Tiny RSS), only two uncommon ones regularly present timestamps after my Last-Modified value. And there are a lot of different User-Agents that managed to do a successful conditional GET of my syndication feed.
(There are, unfortunately, quite a lot of User-Agents that fetched my feed but didn't manage even a single successful conditional GET. But that's another matter, and some of them may have an extremely low polling interval. It would take me a lot more work to correlate this with which requests didn't even try any conditional GETs.)
This genuinely surprises me, and means I have to revise my belief that everyone mangles If-Modified-Since. Mostly they don't. As a corollary, parsing If-Modified-Since strings into timestamps and doing timestamp comparisons on them is probably not worth it, especially if Tiny Tiny RSS has genuinely changed.
(My preliminary data also suggests that almost no one has a different timestamp but a matching If-None-Match value, so my whole theory on pragmatic partial matches is irrelevant. As mentioned in an earlier entry, some feed readers get it wrong the other way around.)
PS: I believe that rachelbythebay's more systematic behavioral testing of feed readers has unearthed a variety of feed readers that have more varied If-Modified-Since behavior than I'm seeing; see eg this recent roundup. So actual results on your website may vary significantly depending on your readers and what they use.