Web URL paths don't quite map cleanly onto the abstract 'filesystem API'

June 4, 2022

Generally, the path portion of web URLs maps more or less on to the idea of a hierarchical filesystem, partly because the early web was designed with that in mind. However, in thinking about this I've realized that there is one place where paths are actually a superset of the broad filesystem API; in fact this place actually causes some amount of heartburn and different design decisions in web servers when they serve static files.

The area of divergence is that in the general filesystem API, directories don't have contents, just children. Only files have contents. In web paths, of course, directories very frequently have contents as well as children (if anything, a web path directory that refuses to have contents is rarer than one that does). This is quite convenient for people using the web, but requires web servers to invent a convention for how path directories get their contents (for example, the 'index.html' convention).

(There's no fundamental reason why filesystem directories couldn't have contents as well as children; they just don't. And there are other environments with hierarchical namespaces where people not infrequently would like 'directories' with contents; one example is IMAP.)

One possible reason for this decision in web paths (other than user convenience) is the problem that the root of a web site would otherwise present. The root of a web site almost always has children (otherwise it's a very sparse site), so it must be a directory. If web directories had no contents in the way of filesystem directories, either the web root would have to be special somehow or people would have a bad experience visiting 'http://example.org/'.

(This bad experience would probably drive browsers to assume a convention for the real starting page of web sites, such as automatically trying '/index.html'.)

PS: Another reason for the 'decision' is that any specification would have to go out of its way to say that directories in web paths couldn't have contents and should return some error code if you requested them. Not saying anything special about requesting directories is easier.


Comments on this page:

By Michael at 2022-06-05 04:21:28:

Another problem would likely be that given just a URL, you can't reliably tell if something is intended to be a directory or a file.

Quick, now, given only client-side knowledge: does http://www.example.com/subscribe refer to a directory or to a file? So should a web browser first try /subscribe/index.html, or /subscribe? (Getting this wrong means you wasted a request.) If one fails, what type of request should be performed to the other? What if the request itself is large; say, a POST that submits a lot of content? What if it contains sensitive information (for some definition thereof)? What if it's one of those broken sites which returns HTTP 200 OK instead of HTTP 404 Not Found? What if instead of a 404 you get, say, a 503? And so on. This seems like it places an enormous burden on the client to always do the right thing, and puts a lot of unnecessary restrictions in place on what the server can do. (What if the server's file system can't represent the client's idea of the default page's file name?)

Moving the problem to the web server side conveniently sidesteps many of those issues, with the slight downside that there are now two ways to access the same content: http://www.example.com/ and http://www.example.com/index.html. Such path normalization can be configured pretty easily with conditional redirects; say, any received URL that ends in a directory index page file name (such as index.html) is redirected to the same path without that file name at the end. That very few people (even of the subset who still serve web sites with such files) seem to do something like this might be an indication that even though it can cause a slight search engine scoring penalty, this is not a big problem in practice.

There's no fundamental reason why filesystem directories couldn't have contents as well as children; they just don't.

Not in theory. But you a filesystem has to exist in practice, and the list of children typically is the content of a directory. This was famously literally the case on early UNIX, where you would read() a directory like a file and deal directly with the raw on-disk format. Only later were getdents()/getdirentries() invented.

A filesystem doesn’t have to implement its hierarchy mapping storage this way, of course. But historically most did, and that shaped OS APIs. I can’t think of any platform whose filesystem API does not model files and directories as distinct types of object where directories can have children but not content and files have content but not children. I suspect this now in turn makes it impractical to design a filesystem without designing in these constraints.

Of course a URL space does not impose them. Every URL can have content, and every URL can have children.

But an upshot of all this is that filesystems have ls and the web doesn’t. The content of a URL does not have to bear any relationship to the set of its children. You may get a directory listing if you retrieve the URL, but you also may not; if you do get one, it may be in any format whatsoever; even if you know how to parse it, it may or may not be a complete list of the children available to you; and all this varies on a URL-by-URL basis, even within a single site. Withn filesystems you can have rather stronger expectations in this regard.

If you wanted to provide child enumeration in a URL space, you would have to allow for contents and children to be aspects that any object can have both of, and which can be examined independently. So basically you need another verb (for enumerating children) next to GET (which is for content), with a standardized response format… and before you realise it, you are in the middle of reinventing WebDAV.

By Michael at 2022-06-06 04:49:51:

I can’t think of any platform whose filesystem API does not model files and directories as distinct types of object where directories can have children but not content and files have content but not children.

Unfortunately, I don't think the distinction is that simple. This model breaks down pretty quickly once you consider such things as NTFS alternate data streams or Linux extended file attributes, which are both, in a sense, "children" of files. Even HPFS, per Wikipedia, could store up to 64 KiB of extended attributes for a file.

That said, they are usually yet another type of file system entries, being managed through their own set of system calls, and as such are easy to overlook in naiive implementations, and partly therefore tend to see limited use.

I don't think the distinction is that simple.

In general, sure. For the purposes of mapping between a URL space and a user-controlled filesystem, though, only what you can express via path passed to open() and which the user can freely manipulate by e.g. rename() is relevant. As you say, that excludes complications to the model like alternate streams and extended attributes.

Note that the model not that simple: it admits other types of object besides files and directories, e.g. device nodes, sockets, (sym)links, etc. The only point of simplicity is really that it doesn’t allow non-directory objects to have arbitrary child objects addressable by path. All filesystems seem to adhere to that – and maybe not without a reason: e.g. what would ensue if a symlink to a directory could itself have its own children?

By JRS at 2022-06-11 11:31:30:

Another well known example of a hierarchic system where directories have data is the DNS.

If fact it may be better to say that such systems don't have directories. There is probably an CS paper on what data structures for hierarchies require directories.

Written on 04 June 2022.
« Regular expressions are effectively a (hard) programming language
Checking a few metrics (time series) at once in Prometheus's query language »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 4 21:13:17 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.