Wandering Thoughts archives

2013-08-31

HTML quoting as I currently understand it

Since I was just doing some work with DWiki where I needed to refresh my memory of this, I want to write down what I know, remember, and have worked out before I forget it again. First off there are effectively three areas where you (or at least I) want to quote and escape text in HTML:

  1. When outputting things that are not supposed to be interpreted as HTML, such as in a form <textarea> or just in any situation where they are supposed to be plain text even if a user gave you funny characters.
  2. When embedding things into attribute values, such as the initial or current value of form <input> elements.
  3. In the special sub-case of putting a URL into a link (where you embed it as the href attribute value).

The first two cases must use HTML character entities to escape a number of dangerous characters. In theory which characters you need to escape varies by context; in practice you might as well have a single function that escapes the union of what you need because over-escaping things doesn't hurt (the browser will happily convert everything back). My current belief is that the maximal escaping is to encode &, <, >, ', and " as character entities.

(DWiki effectively has two HTML escaping functions. One is a minimal one for large scale use in rendered DWikiText (where excessive escaping bulks up the HTML and makes it look bad) and the other one is a maximal one for small-scale use in other contexts.)

Escaping URLs is complicated because it depends on how much escaping you can assume has been applied to the URL before it was handed to you and that is effectively a social question. In general use I assume that the URL I've been handed is in a shape where it could be pasted into a browser's location bar and work, which means that it has been %-encoded to some degree and any remaining characters with special meaning in URLs (like ?, &, =, +, and #) are supposed to be there. At that point I want to entity-encode & and %-encode ", ', and > (the latter to be friendly).

(The full list of things you must or should %-escape in URLs is much longer. If you are neurotic it includes things like ~. & must be entity-encoded instead of %-encoded because %-encoding it would remove its special meaning in URLs.)

A URL should not be subject to this encoding until you are actually embedding it in a link. If you have a form field where people enter and re-enter a URL (for example a 'what is your website?' field in a comment form) you want to do HTML entity (form) encoding on it. The reason is that HTML entity encoding is reversible in forms; if you entity-encode something, put it in a form, and then the form is resubmitted you will get back exactly what you originally encoded. If you %-encode something this does not happen.

(If you are showing a URL as plain text I think it depends on where the URL comes from and what use you expect people to make of it. If you are just showing a user-entered URL to them I would entity-encode it so that the browser shows it to them exactly as they entered it. If you expect them to copy it and actually use it, %-encode things.)

Sidebar: But what if people give you URL paths with funny characters?

If you have to worry about things like a % or a ? appearing in the URL path (where it should be %-escaped so that it isn't interpreted as separating the query parameters from the path) my opinion is that you need an API that clearly separates the components of a URL and leaves it to you to glue them back together. At this point you can %-encode away to make sure that the browser interprets everything exactly right.

If you get the URL as a single blob, the only sane way to go is to assume that it is basically correctly formatted apart from some stray characters that you may need to quote mostly for convenience. Doing anything else requires heuristics and guesswork.

HTMLQuoting written at 01:13:31; Add Comment

2013-08-30

I'm done with feeling guilty about using HTML tables for layout

Just over six years ago I wrote that using CSS instead of tables was a hack. As far as I can tell nothing has fundamentally changed since then; using CSS to create flexible, powerful column and row based layouts is still awkward and a hack. The promised high level CSS layout support is not particularly closer to materializing in real browsers now than it was then.

(One difference is that you can apparently now fake it with JavaScript, just in case you would like to have two baroque hacks instead of one.)

I don't know about anyone else, but I've spent a certain amount of the last six years feeling vaguely guilty when I periodically resorted to using tables for column-based layouts. As of now, I'm done with that. Until CSS gets its act together (if ever) tables are going to be my first and guilt-free choice for any grid-based things I need, all the way from just lining up form labels and form fields to full-page layout. And if the tables need additional styling I'm going to add bits of CSS without even thinking twice.

Ultimately what it comes down to is simple: HTML tables make it easier to create grid-like layouts and they work better than at least basic CSS. Letting the ghostly voice of CSS get to me about this is stupid and counterproductive (in that it pushes me away from better, more usable designs).

(As a side note, I'm not at all interested in recreating <table>, <tr>, <td> and so on in suitably marked up <div> and <span> entries plus CSS. I already have the table HTML tags and they work fine. Carefully redoing them in CSS just so that I can say I'm using <div> instead of <table> strikes me as the height of something under almost all circumstances.)

Sidebar: why the 'obvious' CSS solutions are not suitable

It's relatively easy to construct a CSS-based grid layout if you force specific column and/or row sizes (in points, pixels, em, or your absolute unit of choice). I refuse to do that for two reasons. First, because forced sizing has various bad side effects. My absolute requirement for a grid-based layout is that the grid spacing flexes in and out based on the browser size. Second, absolute sizing is a pain in the rear to specify out and to test. The advantage of a flexible grid is that I don't need to worry about more than a bit of this.

(The lack of worry is especially acute when I have very simple design rules like 'split the space in half' or 'just line up the left edge of all of these things'.)

NoMoreTableGuilt written at 00:30:05; Add Comment

2013-08-14

The pragmatics of an HTTP to HTTPS transition

It started on Twitter:

@thatcks: All things considered I've decided it's time my personal website went not just https-available but all-https (with redirects from http).

@zaitcev: wait a moment, didn't you write on your blog how evil it was to redirect http?

Pete Zaitcev is quite correct; I wrote about the issue back in this entry, yet here I am redirecting everything from HTTP to HTTPS myself. There are two answers here. I'll start with the long, rambling one.

The simple answer is that I'm not doing this for security. There's almost nothing 'secure' on my personal website and right now the only person who can do anything security related on it is me (and I can move myself to https directly). I'm doing this for privacy because I feel like making a point, however pointless it is on the grander scale of things.

Of course it would still be a good idea to not redirect from HTTP if you really care about privacy; every HTTP request that gets redirected tells what is now an entirely non-hypothetical eavesdropper the URL that your visitor was requesting and often things like referers. But this is where we run into pragmatic issues, namely backwards compatibility: there are a certain amount of HTTP URLs for my site out there and I would like to not break them. Certainly not right away and probably not ever (because cool URLs don't go away).

Security (in the broad sense) is always about tradeoffs. The most secure, most privacy-enhancing option today would be to move my personal website to being a Tor hidden service with no automatic redirection, but as a side effect this would reduce my traffic to essentially nil. The most clearly usable option would be to continue using HTTP and never mind any (quixotic) privacy concerns on behalf of my few visitors. Somewhere in the middle is the right balance between security (in the form of privacy) and real usability by the visitors that I care about. This balance may be different for every situation and thus for every website.

(Part of what makes it different is what will be revealed about your visitors by this sort of initial traffic analysis. Revealing that they go to your login form is a lot different than revealing that they are trying to look up potentially sensitive things.)

Now it's time for the short one: once people have made a HTTP request it's too late for full security. It doesn't matter what reply your web server gives for the request because a non-hypothetical eavesdropper has already seen the requested URL and other associated data (including at least some of the POST body if any). If they care enough they can reverse engineer what your visitor was trying to visit even if your web server denies the URL's (HTTP) existence; this is especially the case if your web server is willing to confirm that the HTTPS version of the URL actually exists.

In short, the real purpose of refusing to redirect HTTP requests is to force people to stop making them in the first place. It adds no real security until (and unless) people do this.

(It follows that the really secure approach is to shut off your HTTP site entirely; don't even have a web server responding on port 80. If people can't connect they can't send a HTTP request to be snooped on.)

Once people have blown a certain amount of privacy by making that initial HTTP request, how your web server responds is partly a pragmatic question of how effective not redirecting them is going to be in getting rid of those inbound HTTP requests versus how usable you want to be.

In my case my guess is that almost all inbound requests will shift to HTTPS soon even if I do a friendly HTTP to HTTPS redirection and the remaining amount of inbound requests will basically never shift. The requests that will shift come from search engine lookups, links in my syndication feeds being followed, and new links that I and other people tweet, link to, and so on. The requests that won't shift are from the existing links on Twitter, in other people's blog entries, and so on. Even if I broke those links they would be unlikely to go away and to stop generating inbound requests. So on the whole I might as well redirect them; the privacy leak has already happened by the time that my webserver can do anything (assuming I keep a HTTP site at all).

PragmaticHTTPtoHTTPS written at 01:17:20; Add Comment

2013-08-05

Who or what your website is for and more on HTTP errors

Aristotle Pagaltzis commented on my entry about the pragmatics of HTTP errors and I want to reply to a few things.

First off, I want to say that I fully agree with Aristotle's characterization that the real question to ask is what the practical effects of using any particular status code will be. This is an excellent way of putting it and if I'd been clever enough to think of it I would have framed my entire (accidental) series around this question.

A "website used by people" is really a discoverable, self-documenting web API. The distinction isn't between whether they are used by people or programs, it is really only whether the primary generated content format is more human- or more machine-friendly (HTML vs JSON, say).

I disagree with this on a philosophical basis. A website used by people can be treated as a discoverable web API but it is not one; it has not been designed as one and it probably won't evolve as one. To put it one way, people will read and machines won't. A real web API needs machine parseable results (including HTTP error codes), stability, versioning, and a bunch of other things. A website designed for people is unlikely to have those (for good reasons).

(Yes, search engines parse HTML and that's a good thing. But I think that this is worlds away from an actual API.)

I think that this distinction is important to draw because it drastically shapes how your web application responds to errors (at least for general errors). If you really are creating an API then you need to somehow make the responses machine-parseable and unambiguous, which may even require making up your own new HTTP error codes (or the less extreme version of embedding an additional status header with more details in the HTTP response). If you're creating a web application for people what matters is what people will read; actual HTTP error codes are important only for their effects (if any) on caches, web crawlers, and so on if you care about any of those.

(You may not. A web application that is used over HTTPS and only interacts with authenticated users makes caches and web crawlers irrelevant.)

In response to my hypothetical of getting a DELETE for a non-existent URL when your application doesn't even support DELETE, Aristotle gave the web-standards-correct answer:

There is no resource at the URL in the request, so 405 would be sort of perverse to respond with. I'd argue against 403 on similar grounds but couldn't really object to it. Between 404 and 501 it's a toss-up though.

Here is where web standards run into security engineering. If your application doesn't support DELETE at all, the wise thing to do is to reject all DELETE requests out of hand before you attempt to parse the URL, decode query arguments, and so on. Often this is also by far the easiest thing. There is also a strong argument that the specific error code chosen (and the text that accompanies the HTTP response) should be as uninformative as possible, since anyone who tries a DELETE against your web app is trying a destructive operation that you do not support at all.

In general when web clients are attempting something which you don't support, have never advertised, and that can't be an innocent mistake I think that you have almost completely free license to do whatever is convenient from a programming or security standpoint. People who are trying to rattle the doorknobs or even kick the door in do not get any courtesies.

HTTPErrorsAndWebsitePurpose written at 22:28:50; Add Comment

2013-08-03

The paucity of generally useful HTTP error codes

One of the things that I didn't appreciate until I really looked at HTTP error codes is how few generally useful ones there are. To start with, we can divide HTTP error codes into two categories: specific technical failings and general errors. Specific technical failings are things like an Accept: header that the server can't satisfy. There's a bunch of 4xx errors for these cases (and a few 5xx errors), but they aren't useful in general since you're only supposed to generate them in specific technical circumstances.

Once you get into the officially specified general errors, though, there simply aren't that many: 403 Forbidden, 404 Not Found, 410 Gone, 429 Too Many Requests, and maybe 451 Unavailable For Legal Reasons (if you accept Internet drafts) and 400 Bad Request (if you stretch it). On the server error side, 500 Internal Error and 501 Not Implemented are basically it. Of the 4xx errors, only 403 and 404 are really general.

(It's striking how many unofficial HTTP error codes there are in the Wikipedia list. Apparently a lot of people have found the current set inadequate.)

This limited set of error responses means that a web application can't really tell clients very much about what went wrong using error codes alone (at least officially assigned ones). Consider, for instance, a web application that both has access-restricted content and blocks certain clients from some or all content. HTTP error codes alone provide no real way to distinguish between 'you can't have this content because you aren't properly authenticated' and 'you can't have this content because I think you're a robot and robots shouldn't be asking for this' (especially if the web app also has rate limiting and so uses 429).

This has wound up tied into my feeling that specific HTTP errors may not matter that much. If the available HTTP error codes are too limited to really communicate what you mean to the client, your choice of what specific error code you use from the limited general-use set is not necessarily very important.

Sidebar: technical failings versus general errors

I've realized that I draw a big, personal distinction between these two that doesn't necessarily exist. I consider technical failings to be the job of the web server (and the framework if any) to worry about and I basically ignore them when writing an application. The errors I care about are general errors.

Thus I need to clarify and effectively walk back some stuff I said when I asked whether specific HTTP error codes mattered. Getting the error code right for specific technical failings does matter (at least in theory). I was intending to focus on general, application level errors but both didn't make that clear and didn't appreciate just how many 4xx errors there are for technical failings until I'd looked at a list.

HTTPErrorPaucity written at 23:34:51; Add Comment

The pragmatic issues around HTTP error codes mattering (or not)

When I posed the question of whether specific HTTP error codes actually mattered I put the question rather abstractly. But it's really a pragmatic question which I can put this way: how much effort is it worth putting into a web application that is used by people to get your HTTP error codes exactly and completely 'right'?

(I'm biased towards the server perspective because that's what I write, but if you write HTTP clients there's a mirror image question of how much sense it makes to write code that takes action based on fine distinctions in the response codes you get.)

I'll use DWiki (the ball of code behind my techblog) as an example. Broadly speaking DWiki can generate what are conceptually errors when people feed it all sorts of bad or mangled requests, when they ask for URLs that do not exist, when there are internal errors in page rendering (such as a bad template), when they are people we don't like, and when they don't have permission to access the URL. Today DWiki responds to almost all of these situations with either a 403 or a 404 status code, and some permission failures don't generate errors at all (instead you get a page with a message about it). Usually (but not always) DWiki generates 404 errors if the problem is something that could plausibly happen innocently and 403 errors otherwise.

(It would be nice to generate 403 errors for all permission denied situations but DWiki's architecture makes it quite hard for reasons that don't fit in the margins of this entry.)

Could DWiki do 'better', whatever that means? Perhaps. It could use error codes 400, 405, and maybe 505 in some situations, but these are all around the edges. Some uncommon issues should perhaps produce some 5xx error instead of a 404 because they are really a server-side problem with the page.

(DWiki also punts completely on a lot of semi-impossible situations. For example it assumes that either all clients can handle any format it wants to return or that the real web server will worry about checking for this and generating 406 errors when appropriate.)

I could go through every error generation in DWiki along with the list of HTTP status codes and try to match up each error with the closest match (which is often not very close; there are very few HTTP codes for relatively generic errors). But it very much appears to me that this would be both a lot of work for very little gain and also quite subjective (in that what I think is the right error code for a situation might not match up with what someone else expects).

Note again that this is for typical web apps and websites that are used by people instead of web APIs that are used by programs. Web APIs need to have as much thought put into error codes as any API needs into its error responses. But again, the best way to communicate most errors to API clients may not be through the very limited channel of HTTP error codes.

Sidebar: the subjectivity of HTTP error codes

Here are three examples of what I mean by subjectivity in HTTP error code choice.

  • A client requests a non-directory page but puts a '/' on the end of the URL (eg '/page/' instead of '/page'). Is this 400, 404, ignored (with the plain page being served), produces a redirection to the slashless version, or something else?

  • The server validates pages in some way before serving them to clients and a requested page exists but fails to validate. Is this 500, 503, 404, or something else?

    (Note that 404 is basically the generic error response and is documented as such in the HTTP/1.1 RFC.)

  • A client does a DELETE for a URL that doesn't exist and your web app doesn't support DELETE; in fact it only does plain old HEAD, GET, and maybe POST. Bearing in mind that this is extremely unlikely to be just an innocent mistake, is this 405, 501, 404, 403, or something else?

If you poll a bunch of web developers I think you will get a bunch of different answers for all of these.

PragmaticHTTPErrorCodes written at 02:28:37; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.