Wandering Thoughts

2022-09-23

Browsers and them 'supporting' TLS certificate transparency

Certificate Transparency involves all Certificate Authorities logging newly issued TLS certificates in various public 'CT Logs', and then generally adding some Signed Certificate Timestamps (SCTs) to the issued TLS certificate to demonstrate that they've done this. Interested parties can then watch the CT logs to look for bad or mis-issued TLS certificates, and TLS clients can take steps to check that TLS certificates are in the logs. Famously, Firefox currently does not 'support' Certificate Transparency. But what does this actually mean?

The "Browser Requirements" section of the MDN page on Certificate Transparency somewhat gives the game away here. Let me quote:

Firefox does not currently check or require the use of CT logs for sites that users visit.

The minimal meaning of 'supporting' CT in a browser is that the browser verifies that alleged SCTs from some supported CT logs are in fact validly signed, if there are any present in the TLS certificate. Chrome and Safari go further than this, requiring that there be some number of such verified SCTs from approved CT logs in order to accept the TLS certificate.

This is a non-trivial operational issue for a browser, partly because CT logs come and go over time. The browser maker will need to establish a procedure for evaluating and qualifying CT logs as ones that it will check SCTs from, and then it must watch and update its initial list of these CT logs (which has to be known by the browser). Probably you want a fast update mechanism so that you can rapidly push out updates to this list without a full browser upgrade (and the list of qualified SCTs may depend on when the certificate was issued).

To gain more assurance, the browser can potentially try to verify that the certificate is actually in the trusted CT logs it has SCTs from. The best current option appears to be getting this information and 'proof' from the web server, which requires either or both of adding a TLS extension to the TLS handshake (and parsing the result) or parsing an additional extension in any stapled OCSP response. It's not clear how many web servers support either at the moment, so any code added for this might not be widely tested (and might be hard to test). It's not clear to me if either Chrome or Safari attempt to do this at the moment, and anyway stopping here buys the browser almost no more assurance than validating the SCT (for reasons covered in yesterday's entry).

(If TLS servers don't provide this information to browsers, the browser can query CT logs itself for the 'proof', but doing so leaks the web site the user is visiting. This is probably not going to be a popular information leak to add to browsers.)

The browser can go further to try to detect CT log malfeasance (or compromise), but this involves at least one background web request every so often combined with a bunch of code to verify the proof (provided by the CT log) that one Signed Tree Head from a CT log is a subset of another STH. If TLS servers don't provide the extra CT information to browsers, doing this additional verification requires leaking information to the CT log on what website a user visited (okay, what TLS certificate the website used). If this work detects CT log malfeasance, it's not clear what the browser could do; at a minimum, reporting anything useful to the browser vendor probably requires breaking some degree of user privacy to report that you got a 'bad' STH from such and such a website at such and such a time (which also means routinely keeping this information, although browsers could manage it along side the other history information they currently keep and similarly throw it away with that information if asked to).

Chrome and Safari require the presence of some number of valid and verified SCTs from supported CT logs in order to accept TLS certificates, and as a corollary of that both have some program to decide on what are supported CT logs (and update that list from time to time). To do even this, Firefox would need a similar program to manage its own list (and then it would have to write additional cryptography code in the browser).

BrowsersAndCertTrans written at 22:36:20; Add Comment

2022-09-19

Tangled issues with what status we should use for our HTTP redirects

We have a general purpose web server, which includes user home pages. Historically, every so often people moved on but wanted their home pages to redirect to elsewhere, and we generally obliged, using various Apache mechanisms to set up HTTP redirections (most recently with Apache's RewriteMap). However, we haven't had any such new requests for years and years, which means that by now all of our existing such redirections are very old (and, naturally, not all of them still went to working destinations).

When we set up any HTTP redirection, we have historically tended to initially make them 'temporary' redirections (ie, HTTP status 302). Partly this is because it's usually the Apache default, and partly this is because we're concerned that we may have made a mistake (either in configuration or intentions) and historically permanent redirects could be cached in browsers, although I'm not sure how much that happens today. Our most recent version of redirections for people's old home pages were set up this way, and so they've stayed for four years.

Recently we had cause to look at how frequently these old redirections were still being used. To my surprise, a fair number of them were being used fairly often, and not just by search engines crawling them. Some of these uses may be from old URLs embedded in various places, but some of them seem to come from people following search engine links. I don't know for sure that search engines wouldn't be providing these links if we'd using permanent HTTP redirections, but it probably wouldn't hurt. So, more than four years after we set up things as temporary redirections just in case, we got around to making them permanent redirections. Quite possibly we should have left ourselves a note to do it sooner than that, once things were all proven and working.

Except, of course, there is a catch. Every so often we want to remove such a redirection (for example, because it's broken, or no longer desired), and then perhaps later the login name and thus the home page URL will be reused for another person. When that happens, we definitely don't want search engines (or browsers) to be convinced that '<us>/~user/' is permanently redirected to elsewhere, and to refuse to index or use the new, real, non-redirected version. If permanent HTTP redirections make this less likely, we should probably keep our redirections as temporary ones, even if this has other effects.

In part this is a conflict between the needs of the old and the new users of these URLs (or of any URLs). Permanent redirects may help the old users but hurt the new users, while temporary redirects may be the reverse. In theory this means that we should prioritize the needs of new users (who will be our current users) and use temporary redirects, but on the other hand the new users are generally only a theoretical future thing while the redirections for the old users exist now. I don't think I have any simple answers here.

(Let's take it as a given that the redirections will eventually go away and the URLs will eventually be reused. In some ideal worlds, URLs would be permanently claimed by and for their first use, but this is not the world we exist in in practice.)

HTTPRedirectsTypeIssues written at 22:04:46; Add Comment

2022-09-13

My Firefox addons as of Firefox 104 (they haven't changed in a while)

I last wrote about what Firefox addons I used back in the era of Firefox 86, about a year and a half ago. I haven't written about this since then not because I don't care (addons are actually central to my Firefox experience), but because my sets of addons basically hasn't changed since then. There are multiple reasons for this; I try to be conservative about adding addons (partly because of past bad experiences with instability and memory leaks), my needs and interests haven't changed very much, and my current set of addons has been trouble free as far as I can see.

The short list is that I (still) use Foxy Gestures, uBlock Origin, uMatrix (which is still not quite dead), Cookie AutoDelete, Stylus, Textern, Cookie Quick Manager, Certainly Something, HTTP/2 Indicator, ClearURLs, and Open in Browser, although I'm not sure that's doing anything for me. I'm still using HTTPS Everywhere in some of my browser instances, although I've started to turn it off since the EFF is deprecating it. These addons are all stable enough that I can have Firefox running for days at a time without visible memory leaks or performance issues.

(On the other hand, it's been a long time since I looked at how much memory my Firefox instances are using, and now that I do it's not really a small amount of virtual memory. However, the resident set size is under 1 GB. Since I'm not feeling any sluggishness in performance, I'm probably going to let this sleeping dog lie.)

In the past, I've used Decentraleyes in some browser profiles. Based on things I've read about it being outdated and not so useful, I've increasingly wound up turning it off. I've also experimented with Right-Click Borescope (Github) as one way to be able to see bigger versions of images. I'd like to find a good extension that just enlarges an image for me (or all images on the page), but everything I've tried so far has had various bad effects. It annoys me that the best way to do this is to have 'zoom text only' turned off in Firefox's settings and then zoom the page, but such is life.

Sometimes I wonder if I'm missing out on things by not seeking out more addons. On the other hand, I suspect that I'm using more addons than average, and there are things to be wary of in using addons (and using lots of them).

Firefox104AddonsUnchanged written at 22:26:10; Add Comment

2022-08-29

A thought on presentational versus semantic HTML

One of the long running argument topics in web design is semantic versus presentational HTML, which is to say a split between writing HTML (and CSS) purely for how the result looks and authoring HTML that tries to put its semantic meaning first and then style the semantic meaning with CSS. There is a wide spectrum between these two poles, of course, especially once you start caring about issues like (HTML) accessibility. Recently, I had a thought about why this issue persists and why we don't all write semantic HTML and be done with it, especially since semantic HTML is often easier and simpler.

One of the realities of life is that as people, we care about how things look, partly because in practice you can't divorce content from presentation. This means that most people are always going to care about how their HTML looks. If you write semantic HTML, making your HTML look right is a two step process; first you write carefully taxonomized ('semantic') HTML, and then you get it to look right with CSS or whatever. If you write presentational HTML, you have only one step; you write your HTML (and CSS) and directly tweak things if necessary.

People don't always have the capability, the interest, or the time to take these two steps instead of one (this is especially the case for making it look right everywhere). Brute force works pretty broadly; complex processes to turn semantic markup into what looks good (or what you want) is not necessarily so much. If you don't actually need semantic HTML for some other reason, such as content reuse (or you need only a little bit of it), writing presentational HTML is the easier way; it can often get you better looking pages easier.

(An even easier way is to write HTML within the confines of something that provides the style you want, or that looks good to you. Or to not write HTML, for example by writing Markdown and then letting something translate it for you. But people don't always have such a thing readily available.)

I also think that one of the reasons that presentational HTML works so well is that browsers have been implicitly punished for rendering such HTML in a divergent way, or at least in a divergent way that makes it look worse. If you try some new browser and it makes web pages look bad to you, you're probably not really going to stick with it; you're going to go back to your previous one (a similar force acts to keep successive browser versions from changing their rendering). People grouse about bug for bug compatibility, but I think there's a real argument that it's generally the right choice.

(The short version is that there's a huge amount of value in all of the existing HTML out in the world, and degrading that value is a bad thing.)

HTMLSemanticVsVisualThought written at 22:40:06; Add Comment

2022-08-12

My adventure with URLs in a Grafana that's behind a reverse proxy

I was oblique in yesterday's entry, but today I'm going to talk about the concrete issue I'm seeing because it makes a good illustration of how modern web environments can be rather complicated. We run Grafana behind a reverse proxy as part of a website, with all of Grafana under the /grafana/ path. One of the things you can add to a Grafana dashboard is links, either to other dashboards or to URLs. I want all of our dashboards to have a link to the front page of our overall metrics site. The obvious way to configure this is to tell Grafana that you want a link to '/', which as a raw link in HTML is an absolute path to the root of the current web server in the current scheme.

When I actually do this, the link is actually rendered (in the resulting HTML) as a link to '/grafana/', which is the root of the Grafana portion of the website. Grafana is partially configured so that it knows what this is, in that on the one hand it knows what the web server's root URL for it is, but on the other hand its own post-proxy root is '/' (in Apache terminology, we do a ProxyPass of '/grafana/' to 'localhost:3000/'). This happens in both Firefox and Chrome, and I've used Firefox's developer tools to verify that the 'href' of the link in the HTML is '/grafana/' (as opposed to, eg, the link getting rewritten by Javascript on the fly when you hover or click on it).

Grafana dashboards are not served as straight HTML at a given URL. Instead they are created through copious application of Javascript; the HTML you get served initially is just a vague starting point to load the Javascript. This makes it quite difficult to see what the source of the '/grafana/' link is. The information about the link could be sent from the Grafana server to the in-browser Javascript as either HTML or JSON, and it might then be rewritten by the Javascript (from either form of the information). If it's rewritten by Javascript, this could be a general rewriting of URLs that's necessary to make other URLs in the dashboard work; the Grafana server could be generating all URLs using its own root of '/' and then counting on the Javascript to fix all of them up.

(Alternately, the Grafana server could simply have decided (perhaps by accident) that all 'absolute' URLs that you provide are relative to its root, and some combination of the backend server and the frontend Javascript will rewrite them all for you.)

What I take from all of this is that a modern web application is a complicated thing and putting it behind a reverse proxy makes it more so, at least if it's sharing your web server with anything else. Of course, neither of these two things are exactly news. Now that I know a little bit more about how much 'rehydration' Grafana does to render dashboards, I'm a bit more amazed at how seamlessly it works behind our Apache reverse proxy.

PS: Configuring the link value in Grafana to be 'https:/' defeats whatever rewriting is going on. The HTML winds up with that literal text as the 'href' value, and then the pragmatics of how browsers interpret this take over.

GrafanaReverseProxyAndURLs written at 22:34:26; Add Comment

2022-08-11

My uncertainty over whether an URL format is actually legal

I was recently dealing with a program that runs in a configuration that sometimes misbehaves when you ask it to create and display a link to a relative URL like '/'. My vague memory suggested an alternative version of the URL that might make the program leave it alone, one with a schema but no host, so I tried 'https:/' and it worked. Then I tried to find out if this is actually a proper legal URL format, as opposed to one that browsers just make work, and now I'm confused and uncertain.

The first relatively definite thing that I learned is that file URLs don't need all of those slashes; a URL of 'file:/tmp' is perfectly valid and is interpreted the way you'd expect. This is suggestive but not definite, since the "file" URL scheme is a pretty peculiar thing.

An absolute URL can leave out the scheme; '//mozilla.org/' is a valid URL that means 'the root of mozilla.org in whichever of HTTP and HTTPS you're currently using' (cf). Wikipedia's section on the syntax of URLs claims that the authority section is optional. The Whatwg specification's section on URL writing requires anything starting with 'http:' and 'https:' to be written with the host (because scheme relative special URL strings require a host). This also matches the MDN description. I think this means that my 'https:/path' trick is not technically legal, even if it works in many browsers.

Pragmatically, Firefox, Chrome, Konqueror, and Lynx (all on Linux) support this, but Links doesn't (people are extremely unlikely to use Lynx or Links with this program, of course). Safari on iOS also supports this, which is the extent of my easy testing. Since Chrome on Linux works, I assume that Chrome on other platforms, including Android, will; similarly I assume desktop Safari on macOS will work, and Firefox on Windows and macOS.

(I turned to specifications because I'm not clever enough at Internet search terms to come up with a search that wasn't far, far too noisy.)

PS: When I thought that 'https:/path' might be legal, I wondered if ':/path' was also legal (with the meaning of 'the current scheme, on the current host, but definitely an absolute path'). But that's likely more not lega than 'https:/path' and probably less well supported; I haven't even tried testing it.

Sidebar: Why I care about such an odd URL

The obvious way to solve this problem would just be to put the host in the URL. However, this would get in the way of how I test new versions of the program in question, where I really do want a URL that means 'the root of the web server on whatever website this is running on'. Yes, I know, that should be '/', but see above about something mis-handling this sometimes in our configuration.

(I don't think it's Apache's ProxyPassReverse directive, because the URL is transformed in the HTML, and PPR doesn't touch that.)

URLFormatLegalUncertainty written at 23:50:07; Add Comment

2022-07-20

A brute force solution to nested access permissions in Apache

The simplest way to set up Grafana Loki is as a single server that handles both ingesting logs and querying them, much like Prometheus. Unlike Prometheus, Loki is a 'push' system instead of a 'pull' one, where clients send logs to Loki (via HTTP) instead of Loki reaching out to collect logs from them. This matters because of access permissions; it's one thing to allow a system to talk to Loki to send logs, but another thing to let it query everyone's logs. Loki explicitly punts on authentication and access control, leaving it up to you to put something in front of it to do this. Our solution for reverse proxying is Apache.

In a nice world, Loki's HTTP endpoints for log ingestion and log querying would be clearly separated in an URL hierarchy; you might have all push endpoints under /loki/push/ and all query related endpoints under /loki/query/, for example. In the current Loki HTTP API things are not so nicely divided. There is one HTTP endpoint for push, /loki/api/v1/push, but a bunch of other API endpoints both at the same level and above it (including not under /loki at all, for extra fun). This means that what we want to do in Apache is provide relatively open access to /loki/api/v1/push but then provide very restricted access to everything else from Loki's root URL downward, without having to inventory every URL under /loki/ and so on that we want to restrict.

There's probably a way to do this in Apache with the right set of directives, the right ordering and nesting of <Location> things, and so on. But it's at least not obvious to me how to do this, and while I was thinking about it I realized that there was a much simpler solution: you can have multiple reverse proxies to the same thing,, under separate URLs (unless what's talking to you absolutely insists on speaking to a fixed URL on your server, that must start at the root level).

So I have one reverse proxy for /loki/api/v1/push that talks to Loki (with that URL on Loki), and is relatively open. Then I have a completely separate top level URL, let's call it /loki-2/, that's a highly access restricted reverse proxy to all of Loki. Loki Promtail clients can use the push URL without any chance that I'll have made a mistake and given them access to anything else, while authorized external Grafana instances and other tools can connect to the /loki-2/ set of URLs.

(Because I'm controlling access to URLs, not filesystem directories, I'm using <Location> directives for this.)

This solution is brute force, but it works and it's simple to set up and to understand. Since they're completely separate URLs, it's entirely clear how the permissions work (and how they don't interact with each other). The one little thing is that I had to avoid a ProxyPassReverse directive, which I think doesn't work here. Since Loki is an API server, I don't think it's necessary; Loki will not normally be replying with redirects and so on.

(I'd still like to figure out how to do nested permissions here in Apache, where the parent has narrower or non-overlapping permissions than children, because I'm sure that someday I'm going to need to do it for real. But I only have so much energy for wrestling with Apache and doing Internet searches on various keyword combinations.)

Sidebar: Promtail and HTTP authentication

Promtail can use various HTTP authentication methods but in our environment all of them are awkward, and they require carefully guarding the authentication secret in Promtail across our entire fleet, because the secret would give access to all of Loki, not just the push side.

In our setup we could use HTTP authentication on the push URL to make it harder for random people to push random things into Loki. At the moment I don't think this is going to be a problem, so I'm skipping the extra complexity (and extra things that could break).

ApacheNestedAccessBruteForce written at 22:28:41; Add Comment

2022-07-11

It feels surprisingly good to block Bingbot from my blog front page

Back last year I wrote about how Microsoft's Bingbot relentlessly crawled the front page of Wandering Thoughts. On pretty much every day, a single Bingbot IP would request the front page of Wandering Thoughts a thousand times or more. Back in that entry I said I was tempted to block Bingbot from doing that; recently, my vague irritation with Bingbot's ongoing behavior reached a boiling point, and I actually did that.

Since Wandering Thoughts is served through an Apache with mod_rewrite enabled, the block was relatively simple to implement. I just check for an exact match of the request URI (since Bingbot never uses variations) and then match the user agent. By default, successive RewriteCond conditions all must be true so this just works.

(The hardest bit was re-reading the mod_rewrite documentation yet again to determine that I wanted to match against REQUEST_URI. This would have been faster if I'd actually fully read the documentation and followed the cross reference to expression variables.)

That Wandering Thoughts' front page now gives Bingbot 403s hasn't particularly slowed it down. Over the past almost 24 hours, Bingbot has made just under 1,400 requests for the URL from two different IPs (one of which made most of them). It doesn't yet seem to have latched on to any other page with a similar death grip, although my Linux category is somewhat popular with it right now (with 40 requests today). Probably I'm going to have to keep an eye on this.

It's felt surprisingly nice to have this little irritation pushed out of my life. I know, I shouldn't care that Bingbot is doing bad and annoying things, but I do look at what IP addresses are the most active here (excluding blocked requests) and always having Bingbot show up there was this little poke. And while the operators of Bingbot probably will never notice or know, I can feel that I did a little tiny bit to hold badly behaved web spiders to account.

PS: So far today Bingbot has made just over 1,900 successful requests (HTTP 200 result), just over 1,500 requests that were 403'd, 53 requests that got 304 Not Modified responses, and six '404 no such thing' requests. I'm most surprised at the 304 requests, seeing as Bingbot will routinely repeatedly bang on unchanging URLs without getting 304s. If it could at least conditionally request the same thing over and over so it would mostly get 304s, I would probably feel slightly happier with it. Doing 304s for a few things but not the heavily requested URLs is a bit irritating.

BingbotFrontPageBlock written at 23:10:38; Add Comment

2022-07-01

A quiet shift in what tech people build for their blogs

Tech people have always had a certain attraction to building their own blogs instead of using a canned platform. Not every tech person, by any means (there are plenty of people who use readily available platforms because they have better things to spend time and energy on), but there's generally been enough tech people that there have been tendencies and trends. Back when I started Wandering Thoughts, the in thing to do was to build a dynamic blog engine. DWiki, the engine behind Wandering Thoughts was such a dynamic engine, and it was somewhat modeled on others that I saw at the time.

As anyone who's read people's entries on 'I built my own blog/blog engine' knows, things have shifted a lot since the old days of the mid 00s. For some time now, the in thing to build has been a static site or a static site generator (possibly using existing components with different connections; people don't often write new Markdown to HTML renderers from scratch). This has gone along with a general shift in the style of blogs that tech people build for themselves even if they don't write anything new. It at least feels as if a new tech blog is much more likely to be built with static site generation tools than it is to be an installation of Wordpress or another dynamic blogging platform.

(It's entirely possible that this (apparent) shift in general new tech blogs is an artifact of what sort of new tech blogs I wind up seeing, and that there's a great dark matter of such blogs where the authors go with Wordpress or something else simple. Certainly I think that non-tech people starting new blogs generally don't go with static site generators.)

I don't know if there are any particular strong technical reasons for the shift. If anything, it feels like it should have become easier to host a dynamic blog since the mid 00s, due to the vastly increased availability of dedicated (virtual) machine hosting. My perception is that it's basically a shift in the culture, although somewhat pushed by an increasing emphasis on website speed (both normally and under load), with the perception that static sites are faster and 'less wasteful'.

To the extent that I have feelings about this at all, I find it a little bit regretful that tech people have moved away from building dynamic blogs (and dynamic sites). Building Wandering Thoughts and DWiki has taught me a bunch more than I would have learned from writing a static site generator and then letting Apache (or something else) serve it for me.

(This entry was sparked by reading Cool Things People Do With Their Blogs (via) and seeing it call out 'write a custom dynamic blog engine from scratch'. Back in the mid 00s, it at least felt like that was a routine thing to do; nowadays, not so much. Also, see The Demise of the Mildly Dynamic Website (via).)

BuildingBlogFashionShifts written at 21:33:15; Add Comment

2022-06-04

Web URL paths don't quite map cleanly onto the abstract 'filesystem API'

Generally, the path portion of web URLs maps more or less on to the idea of a hierarchical filesystem, partly because the early web was designed with that in mind. However, in thinking about this I've realized that there is one place where paths are actually a superset of the broad filesystem API; in fact this place actually causes some amount of heartburn and different design decisions in web servers when they serve static files.

The area of divergence is that in the general filesystem API, directories don't have contents, just children. Only files have contents. In web paths, of course, directories very frequently have contents as well as children (if anything, a web path directory that refuses to have contents is rarer than one that does). This is quite convenient for people using the web, but requires web servers to invent a convention for how path directories get their contents (for example, the 'index.html' convention).

(There's no fundamental reason why filesystem directories couldn't have contents as well as children; they just don't. And there are other environments with hierarchical namespaces where people not infrequently would like 'directories' with contents; one example is IMAP.)

One possible reason for this decision in web paths (other than user convenience) is the problem that the root of a web site would otherwise present. The root of a web site almost always has children (otherwise it's a very sparse site), so it must be a directory. If web directories had no contents in the way of filesystem directories, either the web root would have to be special somehow or people would have a bad experience visiting 'http://example.org/'.

(This bad experience would probably drive browsers to assume a convention for the real starting page of web sites, such as automatically trying '/index.html'.)

PS: Another reason for the 'decision' is that any specification would have to go out of its way to say that directories in web paths couldn't have contents and should return some error code if you requested them. Not saying anything special about requesting directories is easier.

WebPathsNotQuiteFilesystemAPI written at 21:13:17; Add Comment

(Previous 10 or go back to June 2022 at 2022/06/02)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.