2024-10-07
Things syndication feed readers do with 'conditional GET'
In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).
For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.
- Some or perhaps all versions of NextCloud-News send an
If-Modified-Since
header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is
always going to fail validation and turn into a regular GET
request, whether you compare If-Modified-Since values literally
or consider them as a timestamp and do timestamp comparisons.
NextCloud-News might as well not bother sending an If-Modified-Since
header at all.
- A number of feed readers appear to only update their stored ETag
value for your feed if your Last-Modified
value also changes. In practice there are a variety of things
that can change the ETag without changing the Last-Modified value,
and some of them regularly happen here on Wandering Thoughts,
which causes these feed readers to effectively decay into doing
unconditional GET requests the moment, for example, someone leaves
a new comment.
- One feed reader sends If-Modified-Since values that use a numeric
time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is
also not a reformatted version of a timestamp I've ever given
out, and is after the current Last-Modified value at the time
the request was made. This client reliably attempts to pull my
feed three times a day, at 02:00, 08:00, and 20:00, and the times
of the If-Modified-Since values for those fetches are reliably
00:00, 06:00, and 12:00 respectively.
(I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)
- Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.
All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.
(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)
2024-09-17
My "time to full crawl" (vague) metric
This entry, along with all of Wandering Thoughts (this blog) and in fact the entire wiki-thing it's part of is dynamically rendered from my wiki-text dialect to HTML. Well, in theory. In practice, one of the several layers of caching that make DWiki (this software) perform decently is a cache of the rendered HTML. Because DWiki is often running as an old fashioned Apache CGI, this rendering cache lives on disk.
(DWiki runs in a complicated way that can see it operating as a CGI under low load or as a daemon with a fast CGI frontend under higher load; this entry has more details.)
Since there are only so many things to render to HTML, this on disk cache has a maximum size that it stabilizes at; given enough time, everything gets visited and thus winds up in the disk cache of rendered HTML. The render disk cache lives in its own directory hierarchy, and so I can watch its size with a simple 'du -hs' command. Since I delete the entire cache every so often, this gives me an indicator that I can call either "time to full cache" or "time to full crawl". The time to full cache is how long it typically takes for the cache to reach maximum size, which is how long it takes for everything to be visited by something (or actually, used to render a URL that something visited).
I haven't attempted to systematically track this measure, but when I've looked it usually takes less than a week for the render cache to reach its stable 'full' size. The cache stores everything in separate files, so if I was an energetic person I could scan through the cache's directory tree, look at the file modification times, and generate some nice graphs of how fast the crawling goes (based on either the accumulated file sizes or the accumulated number of files, depending on what I was interested in).
(In theory I could do this from web server access logs. This would give me a somewhat different measure, since I'd be tracking what URLs had been accessed at least once instead of which bits of wikitext had been used in displaying URLs. At the same time, it might be a more interesting measure of how fast things are visited, and I do have a catalog of all page URLs here in the form of an automatically generated sitemap.)
PS: I doubt this is a single crawler visiting all of Wandering Thoughts in a week or so. Instead I expect it's the combination of the assorted crawlers (most of them undesirable), plus some amount of human traffic.
2024-09-02
Apache's odd behavior for requests with a domain with a dot at the end
When I wrote about the fun fact that domains can end in dots and how this affects URLs, I confidently said that Wandering Thoughts (this blog) reacted to being requested through 'utcc.utoronto.ca.' (with a dot at the end) by redirecting you to the canonical form, without the final dot. Then in comments, Alex reported that they got a Apache '400 Bad Request' response when they did it. From there, things got confusing (and are still confusing).
First, this response is coming from Apache, not DWiki (the code behind the blog). You can get the same '400 Bad Request' response from https://utcc.utoronto.ca./~cks/ (a static file handled only by this host's Apache). Second, you don't always get this response; what happens depends on what you're using to access the URL. Here's what I've noticed and tested so far:
- In some tools you'll get a TLS certificate validation failure due
to a name mismatch, presumably because 'utcc.utoronto.ca.' doesn't
match 'utcc.utoronto.ca'. GNU Wget2 behaves this way.
(GNU Wget version 1.x doesn't seem to have this behavior; instead I think it may strip the final '.' off before doing much processing. My impression is that GNU Wget2 and 'GNU Wget (1.x)' are fairly different programs.)
- on some Apache configurations, you'll get a TLS certificate validation
error from everything, because Apache apparently doesn't think
that that the 'dot at end' version of the host name matches any
of its configured virtual host names, and so it falls back to a
default TLS certificate that doesn't match what you asked for.
(This doesn't happen with this host's Apache configuration but it does happen on some other ones I tested with.)
- against this host's Apache, at least lynx, curl, Safari on iOS
(to my surprise), and manual testing all worked, with the request
reaching DWiki and DWiki then generating a redirect to the canonical
hostname. By a manual test, I mean making a TLS connection to
port 443 with a tool of mine
and issuing:
GET /~cks/space/blog/ HTTP/1.0 Host: utcc.utoronto.ca.
(And no other headers, although a random User-Agent doesn't seem to affect things.)
- Firefox and I presume Chrome get the Apache '400 Bad Request' error (I don't use Chrome and I'm not going to start for this).
I've looked at the HTTP headers that Firefox's web developer tools says it's sending and they don't look particularly different or unusual. But something is getting Apache to decide this is a bad request.
(It's possible that some modern web security related headers are triggering this behavior in Apache, and only a few major browsers are sending them. I am a little bit surprised that Safari on iOS doesn't trigger this.)
2024-08-29
The web fun fact that domains can end in dots and canonicalization failures
Recently, my section of the Fediverse learned that the paywall of a large US-based news company could be bypassed simply by putting a '.' at the end of the website name. That is to say, you asked for 'https://newssite.com./article' instead of 'https://newssite.com/article'. People had a bit of a laugh (myself included) and also sympathized, because this is relatively obscure DNS trivia. Later, I found myself with a bit of a different view, which is that this is a failure of canonicalization in the web programming and web server environment.
(One theory for how this issue could happen is that the news company runs multiple sites from the same infrastructure and wants the paywall to only apply to some of them. Modern paywalls are relatively sophisticated programming, so I can easily imagine listing off the domains that should be affected by the paywall and missing the 'domain.' forms, perhaps because the people doing the programming simply don't know that bit of trivial.)
At the textual level, there are a lot of ways to vary host names and URLs. Hostnames are case independent, characters in URLs can be %-encoded, and so on (and I'm going to leave out structural modifications like '/./' and '/../' URL path elements or adding random query parameters). Web programming and web server environments already shield people from some of those by default; for example, if you configure a name-based virtual host, I think basically every web server will treat the name you provided as a case-independent one. Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.
It's my view that this canonicalization should also happen for host and domain names with dots at the end. Your web programming code should not have to even care about the possibility by default, any more than you probably have to care about it when configuring virtual hosts. If you really wanted to know low-level details about the request you should be able to, but the normal, easily accessible information you use for comparing and matching and so on should be canonicalized for you. This way it can be handled once by experts who know all of the crazy things that can appear in URLs, instead of repeatedly by web programmers who don't.
(Because if we make everyone handle this themselves we already know what's going to happen; some of them won't, and then we'll get various sorts of malfunctions, bugs, and security issues.)
PS: I've probably written some web related code that gets this wrong, treating 'domain.' and 'domain' as two separate things (and so probably denying access to the 'domain.' form as an unknown host). In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain, but this is because I put in a general 'redirect all weird domain variations to the canonical domain' feature a long time ago.
(My personal view is that redirecting to the canonical form of the domain is a perfectly valid thing to do in this situation.)
2024-08-02
Modern web PKI (TLS) is very different than it used to be
In yesterday's entry on the problems OCSP Stapling always faced, I said that OCSP Stapling felt like something from an earlier era of the Internet. In a way, this is literally true. The OCSP Stapling RFC was issued in January 2011, so the actual design work is even older. In 2011, Let's Encrypt was a year away from being started and the Snowden leaks about pervasive Internet interception (and 'SSL added and removed here') had not yet happened. HTTPS was a relative luxury, primarily deployed for security sensitive websites such as things that you had to log in to (and even that wasn't universal). Almost all Certificate Authorities charged money (and the ones that had free certificates sometimes failed catastrophically), the shortest TLS certificate you could get generally lasted for a year, and there were probably several orders of magnitude fewer active TLS certificates than there are today.
(It was also a different world in that browsers were much more tolerant of Certificate Authority misbehavior, so much so that I could write that I couldn't think of a significant CA that had been de-listed by browsers.)
The current world of web PKI is a very different place from that. Let's Encrypt, the current biggest CA, currently has almost 380 million active TLS certificates, HTTPS is increasingly expected and required by people and browsers (in order to enable various useful new bits of Javascript and so on), and a large portion of web traffic is HTTPS instead of HTTP. For good reasons, it's become well understood that everything should be HTTPS if at all possible. Commercial Certificate Authorities (ones that charge money for TLS certificates) face increasingly hard business challenges, since Let's Encrypt is free, and even their volume is probably up. With HTTPS connections being dominant, everything related to that is now on the critical path to them working and being speedy, placing significant demands on things like OCSP infrastructure.
(These demands would be much, much worse if Chrome, the dominant browser, checked the OCSP status of certificates. We don't really have an idea of how many CAs could stand up to that volume and how much it would cost them.)
In the before world of 2011, being a Certificate Authority was basically a license to print money if you could manage some basic business and operations competence. In the modern world of 2024, being a general Certificate Authority is a steadily increasing money sink with a challenging business model.
2024-08-01
OCSP Stapling always faced a bunch of hard problems
One reaction to my entry about how the Online Certificate Status Protocol (OCSP) is basically dead is to ask why OCSP Stapling was abandoned along with OCSP and why it didn't catch on. The answer, which will not please people who liked OCSP Stapling, is that OCSP Stapling was always facing a bunch of hard problems and it's not really a surprise that it failed to overcome them.
If OCSP Stapling was to really deliver serious security improvements, it had to be mandatory in the long run. Otherwise someone who had a stolen and revoked certificate could just use the certificate without any stapling and have you fall back to trusting it. The OCSP standard provided a way to do this, in the form of the 'OCSP Must Staple' option that you or your Certificate Authority could set in the signed TLS certificate. The original plan with OCSP Stapling was that it would just be an optimization to basic OCSP, but since basic OCSP turned out to be a bad idea and is now dead, OCSP Stapling must stand on its own. As a standalone thing, I believe that OCSP Stapling has to eventually require stapling, with CAs normally or always providing TLS certificates that set the 'must staple' option.
Getting a web server to do OCSP Stapling requires both software changes and operational changes. The basic TLS software has to provide stapled OCSP responses, getting them from somewhere, and then there has to be something that fetches signed OCSP responses from the CA periodically and stores them so that the TLS software could use them. There are a lot of potential operational changes here, because your web server may go from a static frozen thing that does not need to contact things in the outside world or store local state to something that needs to do both. Alternately, maybe you need to build an external system to fetch OCSP responses and inject them into the static web server environment, in much the same way that you periodically have to inject new TLS certificates.
(You could also try to handle this at the level of TLS libraries, but things rapidly get challenging and many people will be unhappy if their TLS library starts creating background threads that call out to Certificate Authority sites.)
There's a lot of web server software out there, with its development moving at different speeds, plus people then have to get around to deploying the new versions, which may literally take a decade or so. There are also a whole lot of people operating web servers, in a widely varied assortment of environments and with widely varied level of both technical skill and available time to change how they operate. And in order to get people to do all of this work, you have to persuade them that it's worth it, which was not helped by early OCSP stapling software having various operational issues that could make enabling OCSP stapling worse than not doing so.
(Some of these environments are very challenging to operate in or change. For example, there are environments where what is doing TLS is an appliance which only offers you the ability to manually upload new TLS certificates, and which is completely isolated from the Internet by design. A typical example is server management processors and server BMC networks. Organizations with such environments were simply not going to accept TLS certificates that required a weekly, hands-on process (to load a new set of OCSP responses) or giving BMCs access to the Internet.)
All of this created a situation where OCSP Stapling never gathered a critical mass of adoption. Software for it was slow to appear and balky when it did appear, many people did not bother to set stapling up even when what they were using eventually supported it, and it was pretty clear to everyone that there was little benefit to setting up OCSP stapling (and it was dangerous if you told your CA to give you TLS certificates with OCSP Must Staple set).
Looking back, OCSP Stapling feels like something designed for an earlier Internet, one that was both rather smaller and much more agile about software and software deployment. In the (very) early Internet you really could roll out a change like this and have it work relatively well. But by the time OCSP Stapling was being specified, the Internet was lot like that any more.
PS: As noted in the comments on my entry on OCSP's death, another problem with OCSP Stapling is that if used pervasively, it effectively requires CAs to create and sign a large number of mini-certificates on a roughly weekly basis, in the form of (signed) OCSP responses. These signed responses aren't on the critical latency path of web browser requests, but they do have to be reliable. The less reliable CAs are about generating them, the sooner web servers will try to renew them (for extra safety margin if it takes several attempts), adding more load.
2024-07-24
The Online Certificate Status Protocol (OCSP) is basically dead now
The (web) TLS news of the time interval is that Let's Encrypt intends to stop doing OCSP more or less as soon as Microsoft will let them. Microsoft matters because they are apparently the last remaining major group that requires Certificate Authorities to support OCSP in order for the CA's TLS root certificates to be supported. This is functionally the death declaration for OCSP, including OCSP stapling.
(The major '(TLS) root programs' are all associated with browsers and major operating systems; Microsoft for Windows and Edge, Apple for macOS, iOS, and Safari, Google for Chrome and Android, and Mozilla for Firefox and basically everyone else.)
Let's Encrypt is only one TLS Certificate Authority so in theory other CAs could keep on providing OCSP. However, LE is the dominant TLS CA, responsible for issuing a very large number of the active TLS certificates, plus CAs don't like doing OCSP anyway because it takes a bunch of resources (since you have to be prepared for a lot of browsers and devices to ask you for the status of things). Also, as a practical matter OCSP has been mostly dead for a long time because Chrome hasn't supported OCSP for years, which means that only a small amount of traffic will be affected by the OCSP status of TLS certificates used for the web (which has periodically led to OCSP breaking and causing problems for people using browsers that do check, like Firefox; I've disabled OCSP in my Firefox profiles for years).
I suspect that Let's Encrypt's timeline of three to six months after Microsoft allows them to stop doing OCSP is better understood as 'one to two Let's Encrypt certificate rollovers', since all of LE's certificates are issued for 90 days. I also suspect that people will have enough problems with web servers (and perhaps client programs) that it will wind up being more toward the six month side.
Personally, I'm glad that OCSP is finally and definitely dying, and not just because I haven't had good experiences with it myself (as a Firefox user; as a website operator we never tried to add OCSP stapling). Regardless of its technical design, OCSP as an idea and a protocol is something that doesn't fit well into the modern Internet and how we understand the political issues involved with Internet-scale things (like how much they cost and who pays for them, what information they leak, what the consequences of an outage are, how much they require changes to slow-moving server software, and so on).
2024-07-14
The Firefox source code's 'StaticPrefs' system (as of Firefox 128)
The news of the time interval is that Mozilla is selling out Firefox
users once again (although Firefox remains far better than Chrome),
in the form of 'Privacy-Preserving Attribution', which you might
impolitely call 'browser managed tracking'. Mozilla enabled this
by default in Firefox 128 (cf), and if you
didn't know already you can read about how to disable it here
or here.
In the process of looking into all of this, I attempted to find
where in the Firefox code the special
dom.private-attribution.submission.enabled
was actually used,
but initially failed. Later, with guidance from @mcc's information, I managed to
find the code
and learned something about how Firefox handles certain 'about:config'
preferences through a system called 'StaticPrefs'.
The Firefox source code defines a big collection of statically known
about:config preferences, with commentary, default values, and their
types, in modules/libpref/init/StaticPrefList.yaml
(reading the comments for preferences you're interested is often
quite useful). Our dom.private-attribution.submission.enabled
preference is defined there. However, you will search the Firefox
source tree in vain for any direct reference to accessing these
preferences from C++ code, because their access functions are
actually created as part of the build process, and even in the build
tree they're accessed through #defines that are in StaticPrefListBegin.h.
In the normal C++ code, all that you'll see is calls to
'StaticPrefs:<underscore_name>()', with the name of the function
being the name of the preference with .'s converted to '_'
(underscores), giving names like
dom_private_attribution_submission_enabled. You can see this
in dom/privateattribution/PrivateAttribution.cpp
in functions like 'PrivateAttribution::SaveImpression()'
(for as long as this source code lives in Firefox before Mozilla
rips it out, which I hope is immediately).
(In the Firefox build tree, the generated file to look at is modules/libpref/init/StaticPrefList_dom.h.)
Some preferences in StaticPrefList.yaml aren't accessed this way by C++ code (currently, those with 'mirror: never' that are used in C++ code), so their name will appear in .cpp files in the Firefox source if you search for them. I believe that Firefox C++ code can also use additional preferences not listed in StaticPrefList, but it will obviously have to access those preferences using their preferences name. There are various C++ interfaces for working with such preferences, so you'll see things like a preference's value looked up by its name, or its name passed to methods like nsIPrincipal's 'IsURIInPrefList()'.
A significant amount of Firefox is implemented in JavaScript. As far as I know, that JavaScript doesn't use StaticPrefs or any equivalent of it and always accesses preferences by their normal about:config name.
2024-06-19
It seems routine to see a bunch of browser User-Agents from the same IP
The sensible thing to do about the plague of nasty web spiders and other abusers is to ignore it unless it's actually affecting your site. There is always some new SEO optimizer or marketer or (these days) LLM dataset collector crawling your site, and trying to block all of them is a Sisyphean labour. However, I am not necessarily a sensible person, and sometimes I have potentially clever ideas. One of my recent clever ideas was to look for IP addresses that requested content here using several different HTTP 'User-Agent' values, because this is something I see some of the bad web spiders do. Unfortunately it turns out that this idea does not work out, at least for me and based on my traffic.
Some of the IP addresses that are using multiple User-Agent are clearly up to no good; for example, right now there appears to be an AWS-based stealth spider crawling Wandering Thoughts with hundreds of different User-Agents from its most prolific AWS IPs. Some of its User-Agent values are sufficiently odd that I suspect it may be randomly assembling some of them from parts (eg, pick a random platform, pick a random 'AppleWebKit/' value, pick a random 'Chrome/' value, pick a random 'Safari/' value, and put them all together regardless of whether any real browser ever used the combination). This crawler also sometimes requests robots.txt, for extra something. Would it go away if you got the right name for it? Would it go away if you blocked all robots? I am not going to bet on either.
Some of the sources of multiple User-Agent values are legitimate robots that are either presenting variants of their normal User-Agent, all of which clearly identify them, or are multiple different robots operated by the same organization (Google has several different robots, it turns out). A few sources appear to have variant versions of their normal User-Agent string; one has its regular (and clear) robot identifier and then a version that puts a ' Bot' on the end. These are small scale sources and probably small scale software.
Syndication feed fetchers (ie, RSS or Atom feed fetchers) are another interesting category. There are a number of feed aggregators pulling various of my syndication feeds for different people, which they put some sort of identifier for in their User-Agent (or a count of subscribers), along with their general identification. At the small scale, some people seem to be using more than one feed reader (or feed involved thing) on their machines, with each program fetching things independently and using its own User-Agent. Some of this could also be several different people behind the same IP, all pulling my feed with different programs.
This is in fact a general thing. If you have multiple different devices at home, all of them behind a single IPv4 address, and you visit Wandering Thoughts from more than one, I will see more than one User-Agent from the same IP. The same obviously happens with larger NAT environments.
An interesting and relatively new category is the Fediverse. When a Fediverse message is posted that includes a URL, many Fediverse servers will fetch the URL in order to generate a link preview for their users. To my surprise, a surprising number of these fetches seem to be routed through common front-end IPs. Each Fediverse server is using a User-Agent that identifies it specifically (as well as its Fediverse software), so I see multiple User-Agents from this front-end. Today, the most active front end IP seems to have been used by 39 different Mastodon servers. Meanwhile, some of the larger Fediverse servers use multiple IPs for this link preview generation.
The upshot of all of this is that looking at IPs that use a lot of different User-Agents is too noisy to be useful for me to identify new spider IPs. Something that shows up with a lot of different User-Agents might be yet another bot IP, or it might be legitimate, and it's too much work to try to tell them apart. Also, at least right now there are a lot of such bot IPs (due to the AWS-hosted crawler).
Oh well, not all clever ideas work out (and sometimes I feel like writing up even negative results, even if they were sort of predictable in hindsight).
2024-06-14
Mixed content upgrades on the web in mid 2024
To simplify, mixed content happens when a web page is loaded over HTTPS but it uses 'http:' URLs to access resources like images, CSS, Javascript, and other things included on the page. Mixed content is a particular historical concern of ours for moving our main web server to HTTPS, because of pages maintained by people here that were originally written for a non-HTTPS world and which use those 'http:' URLs. Mixed content came to my mind recently because of Mozilla's announce that Firefox will upgrade more Mixed Content in Version 127, which tells you, in the small print, that Firefox 127 normally now either upgrades mixed content to HTTPS or blocks it; there is no more mixed content warnings. This is potentially troublesome for us when you couple it with Firefox moving towards automatically trying the main URL using HTTPS.
However, it turns out that this change to mixed content behavior is probably less scary to us than I thought, because it looks like Firefox is late to this party. Per this Chromium blog entry and the related Chromium feature status page, Chrome has been doing this since Chrome 86, which it appears was released back in late 2020. As for Safari, it appears that Safari just unconditionally blocks mixed content without trying to upgrade it under default circumstances (based on some casual Internet searching).
(The non-default circumstance is if the web server explicitly says to upgrade things with a 'upgrade-insecure-requests' Content-Security-Policy, which has been supported on all browsers for a long time. However this only applies to the website's own URLs; if the web page fetches things from other URLs as 'http:', I'm not sure if this will upgrade them.)
So people accessing our sites over HTTPS have probably mostly been subjected to mixed content upgrades and blocks for years. Only the Firefox people have been seeing mixed content (with mixed content warnings), and now they're probably getting a better experience.
What we really have to look out for is when browsers will start trying HTTP to HTTPS upgrades for URLs that are explicitly entered as 'http:' URLs. For people hitting our websites, such 'http:' URLs could come from bookmarks, links on other (old) websites, or URLs in published papers (or just papers that circulate online) or other correspondence.
(As long as browsers preserve a fallback to HTTP, this won't strictly be the death knell of HTTP on the web. It will probably be a death knell of things like old HTTP only image URLs that assorted people on the net keep using as image sources, but the people with those URLs may consider this a feature.)