2009-05-20
Why directory URLs have to have trailing slashes
Web servers and many web applications are quite insistent that URLs that represent 'directories' have to end in a slash; if you request the URL without the trailing slash, they just give you a redirection to the same URL with a slash on the end instead of the actual content. One might well wonder why they have this neurotic insistence, especially when it complicates URL rewriting for applications that can do whatever they want with incoming URLs anyways.
(The other side of this question is whether web applications have to care about this, and if they do, why.)
The answer is that it's required in order to correctly resolve any
relative URLs in the document. Consider a 'directory' page /a/b that
contains the relative link <a href="c.html">. If the URL of the
page the browser is dealing with doesn't have the slash (if it is just
/a/b), the browser will make the relative link point to /a/c.html,
but if the URL has a slash on the end (if it is /a/b/), the same link
will point to /a/b/c.html. Presumably only one of these is correct and
intended.
(The official source for this whole process is RFC 2396, updating the original RFC 1808.)
Whether this matters to your web application depends on what sort of links you generate. If you always use absolute paths, then you don't need to care; you can ignore the situation and give people the same contents regardless of the presence or absence of the trailing slash. If you do use relative links, then you need to notice the situation and either force the redirection or generate slightly different content.
(I would suggest forcing the redirection on the grounds that it is less confusing to both Google and users; otherwise you have two URLs that are the same thing.)
(This is one of those entries that I write to tack things down firmly in my mind, after a co-worker had to remind me of all of this.)
2009-05-19
Some notes on rewrites in Apache .htaccess files
Since I keep rediscovering this every so often, here's what I know about
rewrite rules in .htaccess files so that I can just read it here the
next time around.
Some basics:
- you need a '
RewriteEngine on' statement, even if the rewrite engine is already on in the main configuration. - the 'URLs' that you match against in
RewriteRuleare relative to the directory the.htaccessis in. However, Apache variables like%{REQUEST_FILENAME}that you use inRewriteCondare the full real URLs, not URLs relative to the directory. This makes sense, but does mean one has to keep track of it all.
Suppose that you want to have a 'directory' that is actually a CGI-BIN. There are two ways to do this:
- make an actual directory, and put a
.htaccessin it that has:RewriteRule ^(.*)$ /cgis/my-cgi/$1 [PT]Apache itself will then handle generating a redirect for people who ask for the directory without the trailing slash; your CGI-BIN does not have to worry about it.
- put a
.htaccesin the directory that is one level up. This should have something like:RewriteRule ^foo$ /cgis/my-cgi [PT]
RewriteRule ^foo/(.*)$ /cgis/my-cgi/$1 [PT]Your CGI will have to generate the redirect when people ask for the directory without the trailing slash (or, well, do whatever you want with their requests); Apache won't do anything special for you.
It is common to implement the latter approach with a single rewrite rule:
RewriteRule ^foo(.*)$ /cgis/my-cgi/$1 [PT]
However, this is incorrect because it matches too much; it will send
any URL in that directory that starts with foo off to your CGI-BIN,
including things like a request for 'foobar'.
(You may not care about this. I do, partly because I don't like handing my CGIs URLs that they're not actually supposed to be handling.)
PS: the very similar looking destination '/cgis/my-cgi$1' is very
much not what you want; in fact, I believe that it's a security risk,
as I think it means that Apache can be tricked into running things like
'/cgis/my-cgi.old' with a suitable request.