2023-09-21
HTTP Basic Authentication and your URL hierarchy
We're big fans of Apache's implementation of HTTP Basic Authentication, but we recently discovered that there are
some subtle implications of how Basic Authentication can interact
with your URL hierarchy within a web application. This is because of when HTTP
Basic Authentication is and isn't sent to you,
and specifically that browsers don't preemptively send the
Authorization
when they are moving up your URL hierarchy (well, for the first
time). That sounds abstract, so let's give a more concrete example.
Let's suppose you have a web application that allows authenticated
people to interact with it at both a high level, with a URL of
'/manage/', and at the level of dealing with a specific item, with
an URL of say '/manage/frobify/{item}'. You would like a person to
frobify some item, so you (automatically) send them email saying
'please visit <url> to frobify {item}'. They visit that URL while
not yet authenticated, which causes the web server to return a
HTTP 401
and gets their browser to ask them for their login and password on
your site (for a specific 'realm', effectively a text label). Their
re-request with an Authorization
header succeeds, and the
person delightedly frobifies their item. At the end of this process,
your web application redirects them to '/manage/'. Because this
URL is above the URL the person been dealing with, their browser
will not preemptively send the Authorization
header, and your
web server will once again respond with a HTTP 401.
Because this is all part of the same web application, your HTTP
Basic Authentication
will use the same realm
setting for both URLs in your web server
and thus your WWW-Authenticate
header. In theory the browser can see that it already knows an
authentication for this realm and automatically retry with the
Authorization
header. In practice a browser may not always
do this in all circumstances, and may instead stop to ask the person
for their login and password again. With this URL design you're at
the mercy of the browser to do what you want.
(This can be confusing to the person, especially if (from their perspective) they just pressed a form button that said 'yes, really frobify {item}' and now they're getting challenged again. They may well think that their action failed; after all, successful actions don't usually cause you to get re-challenged for authentication.)
Unfortunately it's hard to see how to get out of this while still having a sensible URL hierarchy, short of never sending people direct links for actions and always having them enter at the top level of your application. One not entirely great option is that when people frobify their items, they are never automatically redirected up; instead they just wind up back on '/manage/frobify/{item}' except that now the page says 'congratulations, you have frobified this item, click here to go to the top level'. This is slightly less convenient (well, if people actually want to go to your '/manage/' page) but won't leave people in doubt about whether or not they really did successfully frobify their item.
When you look at your logs, this behavior may be surprising to you if you've forgotten the complexities of when browsers preemptively send HTTP Basic Authentication information. HTTP Basic Authentication doesn't work like a regular cookie, where you can set it once and then assume it will always come back, which is the model of authentication we're generally most familiar with.
2023-09-16
Apache's HTTP Basic Authentication could do with more logging
Suppose, not entirely hypothetically, that you use Apache and have an area of your website protected with Apache's HTTP Basic Authentication. A user comes to you with a problem report; while interacting with this area of the site, they unexpectedly got re-challenged for authentication. In fact, in your Apache logs you can see that they made an authenticated request that returned a HTTP redirect and literally moments later their browser's GET of the redirection target was met with a HTTP 401 response, indicating that Apache didn't think they were authenticated or maybe authorized. Unfortunately, our options for understanding exactly what happened are limited, because Apache doesn't really do logging about the Basic Authentication process.
There is one useful (or even critical) piece of information that Apache does log in the standard log format, and that is whether or not the HTTP 401 was because of a lack of authorization. Both normally get HTTP 401 responses (although you can change that with AuthzSendForbiddenOnFailure and perhaps should), but they appear differently in the normal access log. If there was a successful authentication but the user was not authorized, you will see their name in the log file:
192.168.1.1 - theuser [...] "GET /restricted HTTP/1.1" 401 ....
If they are not authenticated (for whatever reason), then there will be no user name logged; the leading bit will just be '192.168.1.1 - -'.
However, there are at least five reasons why this request was not authenticated (in Apache's view) and you can't tell them apart. The five reasons are the browser didn't send an Authorization header, the header or a part of it was malformed, the authorization source (such as your htpasswd file) was missing or unreadable, the header contained a user name that Apache didn't find or recognize, or the password in the header was incorrect. It would be nice to know which one of these had happened, because they lead to quite different causes and fixes.
(Apache may log errors if the authorization source is missing or unreadable; I haven't tested. That still leaves the other cases.)
For example, if your logs say that the browser didn't send the header at all, that is probably not a problem on your side. Although the rules for when browsers decide to send this header are a bit complex and potentially surprising, because it doesn't work like cookies, where a browser will always send them once set. And browsers make their own decisions about how to react to HTTP 401 responses on requests where they didn't send Authorization headers, so they may decide to re-ask the person for a name and password even though they have Basic Authentication credentials they could try.
(Having discovered AuthzSendForbiddenOnFailure, I am probably going to set it on several limited-access areas in our Apache configuration, because it's rather more user friendly. It's not an information disclosure for us because there are authenticated but otherwise unrestricted areas on the web server with the same credentials, so an attacker can already validate guessed passwords.)
2023-08-29
Experiencing the increase in web bandwidth usage for myself
Recently, for reasons outside the scope of this entry (cf), I found myself using tethered cellular Internet at home instead of my regular DSL Internet. In many parts of the West this wouldn't be much of a problem, but in Canada our cellular Internet plans are all what you would politely call 'cramped' in terms of monthly transfer limits, and needing to use cellular Internet on a regular basis for what turned out to be more than two weeks really made me watch my usage nervously.
(I've used cellular Internet before during brief interruptions in my DSL connection and during a vacation, but both are different than this time. And in the latter situation I looked into ways to turn Fedora's bandwidth usage down.)
There were certain things I could cut right out, like fetching Fedora package updates, VCS repository updates, and so on. After those, a lot of my remaining use of the Internet was visiting and using websites, including the Fediverse (I don't have a Linux client I like yet, so I use the Mastodon web interface). This, plus monitoring how much I'd transferred, gave me a front row seat on how much bandwidth the modern web casually uses. Often the answer was 'a lot', at least by my standards.
One reason for this is modern web design's love for what is apparently called 'hero images', which are images thrown into text articles to add nominal interest. Hero images often appear at the top and can also be added part way through when the article's creator decides they want to give you something else to look at; they add nothing to the article except more data transferred (and a visual break), and often are relatively large. Modern JavaScript doesn't help, but hero images are a significant reason that even ordinary looking web pages can be 10 Mbytes or more a pop once the dust settles (on a HiDPI display, which may not help either).
I was going to say something about how much bandwidth our Grafana dashboards use, but although I perceived them as heavyweight and avoided looking at them over cellular, now that I've looked our main dashboard is only about 1.8 Mbytes when fully loaded. In one sense that's a lot (and certainly it wasn't fast over my cellular Internet), but compared to other websites it wasn't all that bad, and certainly we nominally get plenty of value from it (unlike hero images).
All of this is far from news; for years, people have been writing about how heavy web pages are and how this affects people with slow Internet and expensive bandwidth. But it had never really affected me, and now it sort of did. Certainly I became much more conscious of just how much bandwidth I could casually go through in a day, even a day when I was consciously trying to stick to the text focused web. Even just using the Internet from home in the evenings, it was hard not to use over 512 Mbytes, and easy to hit 1 Gbytes.
(I have a personal Prometheus and Grafana setup on my home machine for reasons, and it's tempting to add some data panels for 'per day total bandwidth usage'. I'd probably be more routinely aware of it, at least. Although with my DSL back, now I'm doing Fedora package updates and other non-web things that use up bandwidth.)
2023-08-10
Browsers barely care what HTTP status code your web pages are served with
Back when I wrote an entry on issues around the HTTP status code for a web server's default front page, I said in passing that the HTTP status code mostly doesn't matter to browsers. More exactly, the status code for a web page mostly doesn't matter to people looking at web pages in a browser (this has come up before). This is well known in some circles and probably surprising in others.
Certain HTTP status codes cause web browsers to do specific things; there are the HTTP 3xx redirections, HTTP 401 Unauthorized, and some others. However, in general if you respond to a request for a web page with a HTTP 200, 4xx, or 5xx code outside of these specific ones and some HTML, almost all browsers will display the HTML to the user and not expose the actual HTTP status code to them in any obvious way. If the HTML says 'HTTP 500 internal failure', they'll assume one thing; if the HTML says 'welcome to the default server page', they'll assume another thing.
(I'm not sure there's any way to find the HTTP status code in a modern Firefox environment short of using web developer tools. It's not in places like 'Page Info' as far as I can see.)
This is not so true in other browser contexts. If a web page is trying to fetch CSS, JavaScript, or images as sub-resources, I believe that browsers will react very differently to a HTTP 200 response than to the exact same content with the exact same Content-Type but with a HTTP 4xx or 5xx status code; only the successful HTTP 200 response will work. Similarly if there's JavaScript running to fetch HTML chunks and stuff them into the page, it's likely to care (and not work) if you return the same HTML with a HTTP 404 instead of a HTTP 200.
It's a convention (and a useful one) that the HTML served for a web server's error pages will include the HTTP status code in the text (and often the <title> as well). But it's only a convention and it can be violated, both accidentally and deliberately (both in omitting the status code and listing the wrong one). If it is violated, you probably won't notice for a while (if ever).
PS: It turns out that our Django based web application doesn't actually list the HTTP status codes on its various custom error pages, although it does have appropriate text that says you've hit a nonexistent page or an internal error. I probably should at least add a footer saying '(This is a HTTP status 404 error)' (with the correct code for the specific error page).
2023-08-01
Turning off the sidebar of Firefox's built in PDF viewer
Over on the Fediverse, I said:
Firefox tip: do you hate that the built-in PDF viewer opens the space-eating sidebar all the time (I do)? If so, go to about:config and set pdfjs.sidebarViewOnLoad to '0' from its default. As far as I know this is not exposed as a Preferences setting, so you have to use about:config (and search the magic numbers in the Firefox source).
Apparently this behavior may actually be something that defaults to whatever the PDF asks for (also, also). I run into this routinely because a lot of academic papers in PDF form trigger this Firefox behavior (for whatever reason), which is possibly some default inherited through LaTeX or some other common scientific PDF creator.
(PDF documents can apparently have some or all of outlines (which I would call tables of contents), attachments, and 'layers', whatever those are. Looking into how PDF works is always an adventure. I'm not sure if Firefox ever shows the sidebar by default if the PDF doesn't have any of them.)
Firefox's built in PDF viewer is called 'pdf.js' (also), and is nominally sort of a
stand alone project that Firefox incorporates (in this case, in
toolkit/components/pdfjs).
Currently the values for this can be found in the 'const SidebarView
'
declaration in eg viewer.js.
As of writing this entry, the values are:
UNKNOWN: -1, NONE: 0, THUMBS: 1, OUTLINE: 2, ATTACHMENTS: 3, LAYERS: 4
The default value of pdfjs.sidebarViewOnLoad is -1, which apparently
leaves it to whatever the PDF asked for (if the information above
is correct). I believe that this default (and the default for all
of the other pdfjs preferences) can be seen in
PdfJsDefaultPreferences.sys.mjs,
although maybe modules/libpref/init/all.js
and browser/app/profile/firefox.js
are also relevant. Don't ask me, I just run searches over the Firefox
source code tree. In any case, your 'about:config
' for 'pdfjs.'
entries is the authoritative source for your current values.
As far as I know, there's no Preferences UI for any pdf.js settings and pdf.js itself doesn't seem to have such a thing, although I believe that it may remember per-document settings until you restart your Firefox.
(A number of things in Firefox remember per-document or per-host settings in memory, so more or less until Firefox restarts. Other per-host settings, such as your zoom level, are stashed away somewhere that's not exposed in Preferences or about:config.)
2023-07-15
Static websites have a low barrier to entry
Yesterday I wrote about a criticism of the usual static versus dynamic website divide, where I noted a number of (what I see as) significant differences between static and dynamic websites in practice. But there is another important difference in practice, and that's the amount of expertise that's needed to create each of them. Specifically, static websites don't require much expertise to create. Given a suitable general web server environment, which is easy to provide, all you need to set up a static website is the ability to write HTML, or even an authoring and editing tool that will do it for you.
(Well, you need to be able to put your edited files on your web server, but generally this is straightforward, especially at small scale.)
There are dynamic website environments that are almost this simple, but they're also necessarily relatively canned, fixed function environments. The obvious examples of this are the various providers of hosted blogs and similar things such as Dreamwidth. These are all dynamic sites, but your options for changing how they work are usually both relatively modest and more complicated than 'write HTML'. In addition, you have the hidden complexity of picking which of the many varieties best meets your needs.
(Then, over a long enough time span you may have to deal with moving from one canned service to another.)
A fully custom dynamic website has a quite high expertise level, because you need to be able to program (in some environment). When Wesley Aptekar-Cassels' There is no such thing as a static website looks forward to a future where there is a standard API between web servers and dynamic websites so that dynamic websites can be readily moved between providers, it's still seeing a future where you have to program to create your portable but dynamic website. You can't get away from this; it's intrinsic in the extra freedom that being dynamic gives you. Either you have to write code or you have to sort through other people's code to find one that does what you want.
In a sense, static websites have already done this sorting and arrived on a feature set for you. But it happens to be a feature set that's both fairly simple and fairly general, one that's proven to be quite adaptable in practice over the time the web has been here.
Another aspect of this is when you need the expertise. If you have only static HTML and CSS, with no JavaScript, you put essentially all of the expertise and work needed to creating the HTML and CSS. Once created, it's static and won't actively need changes (although you may want to change the look every so often even if the content doesn't get modified). All of the ongoing work lives on the other side of the static website split, in maintaining the web server itself. Pragmatically, this helps a lot to enable low-attention websites, ones that you can mostly ignore for long lengths of time and don't have to constantly look after.
(If you run the web server itself you do have to keep up with things there. But there are ways of offloading this work to someone else, and there are lots of providers of the service.)
2023-07-14
The theory versus the practice of "static websites"
A while back I read Wesley Aptekar-Cassels' There is no such thing as a static website (via) and had some hard to articulate feelings about it. As I read that article, Aptekar-Cassels argues that there is less difference between static and dynamic websites because on the one hand, a static website is more dynamic and complicated than it looks, and on the other hand, it's easier than ever before to build and operate a dynamic web site.
This is the kind of article that makes me go 'yes, but, um'. The individual points are all well argued and they lead to the conclusion, but I don't feel the conclusion is right. Ever since I read the article I've been trying to figure out how to coherently object to it. I'm not sure I have succeeded yet, but I do have some thoughts (which I'm finally pushing out to the world as this entry).
The first thought is that in practice, things look different on a long time scale. The use of static files for web content has proven extremely durable over the years. Although the specific web servers and hosts may have changed, both the static file content and the general approach of 'put your static files with .type extensions in a directory tree' has lived on basically since the beginning of the web. One pragmatic reason for this is that serving static files is both common and very efficient. Since it's commonly in demand even in dynamic websites, people who only have static files can take advantage of this. Being common and 'simple' has meant that serving only static content creates a stable site that's easy to keep operating. This is historically not the case with dynamic websites.
The second thought is that one reason for this is that static websites create a sharp boundary of responsibilities with simple, strong isolation. On the one side is all of the complexity of the static web server (which, today, involves a bunch of dynamic updates for things like HTTPS certificates). On the other side is those static files, and in the middle is some filesystem or filesystem like thing. What each side needs from the other is very limited. Any environment for dynamic websites necessarily has no such clear, small, and simple boundary between the web server and your code, and on top of that we're unlikely to ever be able to standardize on a single boundary and API for it.
(This is not an accident; the web was, in a sense, designed to serve static files.)
As a result, I believe it will always be easier to operate and to provide a static files web server than a dynamic web server and its associated environment. In turn, this makes static files web servers easier to find. This leads to the durability of static websites themselves; they're easier to operate and easier to re-home if the current operator decides to stop doing so. Or at least it leads to the durability of moderate sized or small sized static websites, ones that can fit on a single server.
This leads to my final thought, which is that the distinction between static and dynamic websites is not blurry but is in fact still quite sharp. The distinction is not about how much work is involved in building and operating the site, or how much changes on it on a regular basis (such as HTTPS certificate renewal). Instead, the distinction is about where the boundaries are and what the sides have to care about. A static website has two clear sides and draws a sharp boundary between them that allows them to be quite independent (even if they're operated by the same people, the two sides can be dealt with independently). A dynamic website has no such sharp boundaries or clear sides, although you can create them artificially by drawing lines somewhere.
2023-07-08
The HTTP status code for a web server's default "hello" front page
In a comment on my entry on how web servers shouldn't give HTTP 200 results for random URLs, Jonathan reported something that I find fascinating:
This reminds me of a personal bugbear with the RHEL httpd package, which is the inverse situation: OOTB it’s configured to serve a “hello” page on / via an error handler, so you get an error code for a success.
I personally find this fascinating and can't really vote against it (in contrast to Jonathan). To me, it raises the interesting question of whether a web server's default 'hello I am <X>' front page should actually exist, in the sense of what HTTP status code it should use.
On the one hand, the front page is there and there's often some traditional content to it (announcing the web server, host OS, and so on, although how wise that is these days is an open question). On the other hand, no one has actually set up this front page; the web server is mostly showing it to be friendly, especially in a completely stock configuration as installed by a package manager (where everyone can assume that the configuration itself is working). Since no actual person has deliberately set up the front page, I can see an argument that the right HTTP response code is a 404 not found. In the sense of deliberate content put there by the website operator, there is no front page.
As with other HTTP error codes, the real answer is that one should probably use whatever status code is most convenient. On the one hand, the returned HTTP status code mostly doesn't matter to browsers and thus the people using them; most browsers just display the HTML of the HTTP error page with no UI indication of the actual status code. On the other hand, the HTTP status code does matter (sometimes a lot) to programs that hit the URL, including status monitoring programs; these will probably consider their checks to fail if the web server returns a 404 and succeed if it returns a 200. If you're pointing status checking programs at the front page of your just set up web server to make sure it's up, probably you want a HTTP 200 code (although not if the real thing you're checking is whether or not the web server and the site have been fully set up).
(The actual default front page behavior of various web server setups is something I'd probably never count on. All of our web servers have specifically created front pages, even if the front page just says 'there's nothing here'. These days I'd only leave a default front page in place if I was creating some sort of honeypot web server where I wanted to lure attackers in with the prospect of an un-configured server.)
2023-07-05
The mere 'presence' of an URL on a web server is not a good signal
There are a variety of situations where you (in the sense of programs and systems) want to know if a web server supports something or is under someone's control. One traditional way is to require the publication of specific URLs on the web server, often URLs with partially random names. The simplest way to implement this is to simply require the URL to exist and be accessible, which is to say that fetching it returns a HTTP 200 response. However, in light of web server implementations which will return HTTP 200 responses for any URL, or at least many of them, this simple check is clearly not sufficient in practice. The mere 'presence' of a URL on a web server proves very little.
If you need to implement this sort of protocol, you need to require the URL to contain some specific contents. Since web servers may echo some or all of the URL and any attribute of the HTTP request into the synthetic page they helpfully generate for you on such phantom URLs, the specific contents you require shouldn't appear in any of those. It's probably safe to deterministically derive them from some of what you send, although the complete independence of URL, HTTP request, and required contents is the best.
I don't think any existing, widely used 'prove something on the web server' protocols uses mere URL presence, so this is an oversight that's more likely to be made in locally developed systems. For example, the ACME TLS certificate issuance protocol requires that some additional data be returned in the response, and I believe it implicitly requires that nothing else be returned (see section 8.1 and section 8.3).
Production web servers that are intended to serve real data are probably not going to be vulnerable to this sort of issue. The danger is more likely to come from systems and devices that are running web servers as an incidental effect of allowing remote management or using HTTP as a communication protocol (as in the case of Prometheus's host agent). However, there are a variety of ways people might be able to exploit such devices in combination with protocols. Plus there are accidents, where some auto-checking program decides that some host has some capability just because it seems to have some URL active.
(This feels obvious now that I've written it out, but until I ran into the issue recently I might have made this mistake if I was designing some sort of simple HTTP probe check.)
2023-07-04
Web servers should refuse requests for random, unnecessary URLs
We periodically check our own
networks with an (open source) vulnerability scanner, whose rules
get updated from time to time. Recently a scan report lit up with
a lot of reports to the effect of 'a home directory is accessible
via this web server' for our machines. The web servers in question
were all on port 9100, and the reason the security scanner triggered
this alert is that it could successfully request a couple of URLs
like '/.bash_history
' from them.
As you might guess, this is a false positive. On our machines, TCP
port 9100 is where the Prometheus host agent listens so that it
can be scraped by our Prometheus server, and it definitely wasn't
serving anyone's home directories (although the host agent is a
HTTP server, because HTTP is basically the universal protocol at
this point). What was happening instead is that the Prometheus
host agent's HTTP server code will give you a HTTP 200 answer (with
a generic front page) for any URL except the special URL for its
metrics endpoint. Since the security scanner asked for various
URLs like '/.bash_history
' and got a HTTP 200 response, it
decided each of the machines it checked on port 9100 had that
vulnerability.
Neither party is exactly wrong here, but the result is not ideal. Given that security scanners and other things like them aren't uncommon, my view is that web servers should try to be more selective. A web server like this can actually be selective without even changing the HTML served; all it would need to do is only give a HTTP 200 response for '/' and then a 404 (with the same HTML) for everything else that it answers with the generic front page. This would have the same functional result (visitors would get a page with the URL of the metrics endpoint), but avoid false positives from security scanners and anything else poking around.
(In practice, web browsers and people mostly don't care about or notice the HTTP return code. The browser presentation of the HTML of a HTTP error page is generally identical to the presentation of the same HTML from an URL that had a HTTP 200 success reply.)
Ideally, the APIs of web service libraries would make it easy to do this. Here, the Go net/http ServeMux API is less than ideal, since there's no simple way to register something that handles only the root URL and not everything under it. Instead, your request handler has to specifically check for this case (as covered in the example for ServeMux's Handle() method).
PS: Security scanners and other tools could adopt various heuristics to detect this sort of situation and reduce false positives, but ultimately they're only heuristics, which means they'll always be incomplete and sometimes may be wrong. Dealing with this in the web server is the better way.