HTML quoting as I currently understand it
Since I was just doing some work with DWiki where I needed to refresh my memory of this, I want to write down what I know, remember, and have worked out before I forget it again. First off there are effectively three areas where you (or at least I) want to quote and escape text in HTML:
- When outputting things that are not supposed to be interpreted as
HTML, such as in a form
<textarea>or just in any situation where they are supposed to be plain text even if a user gave you funny characters.
- When embedding things into attribute values, such as the initial
or current value of form
- In the special sub-case of putting a URL into a link (where you
embed it as the
The first two cases must use HTML character entities to escape a number of dangerous characters. In theory which characters you need to escape varies by context; in practice you might as well have a single function that escapes the union of what you need because over-escaping things doesn't hurt (the browser will happily convert everything back). My current belief is that the maximal escaping is to encode &, <, >, ', and " as character entities.
(DWiki effectively has two HTML escaping functions. One is a minimal one for large scale use in rendered DWikiText (where excessive escaping bulks up the HTML and makes it look bad) and the other one is a maximal one for small-scale use in other contexts.)
Escaping URLs is complicated because it depends on how much escaping you can assume has been applied to the URL before it was handed to you and that is effectively a social question. In general use I assume that the URL I've been handed is in a shape where it could be pasted into a browser's location bar and work, which means that it has been %-encoded to some degree and any remaining characters with special meaning in URLs (like ?, &, =, +, and #) are supposed to be there. At that point I want to entity-encode & and %-encode ", ', and > (the latter to be friendly).
(The full list of things you must or should %-escape in URLs is much
longer. If you are neurotic it includes things like
~. & must be
entity-encoded instead of %-encoded because %-encoding it would remove
its special meaning in URLs.)
A URL should not be subject to this encoding until you are actually embedding it in a link. If you have a form field where people enter and re-enter a URL (for example a 'what is your website?' field in a comment form) you want to do HTML entity (form) encoding on it. The reason is that HTML entity encoding is reversible in forms; if you entity-encode something, put it in a form, and then the form is resubmitted you will get back exactly what you originally encoded. If you %-encode something this does not happen.
(If you are showing a URL as plain text I think it depends on where the URL comes from and what use you expect people to make of it. If you are just showing a user-entered URL to them I would entity-encode it so that the browser shows it to them exactly as they entered it. If you expect them to copy it and actually use it, %-encode things.)
Sidebar: But what if people give you URL paths with funny characters?
If you have to worry about things like a
% or a
? appearing in the
URL path (where it should be %-escaped so that it isn't interpreted as
separating the query parameters from the path) my opinion is that you
need an API that clearly separates the components of a URL and leaves it
to you to glue them back together. At this point you can %-encode away
to make sure that the browser interprets everything exactly right.
If you get the URL as a single blob, the only sane way to go is to assume that it is basically correctly formatted apart from some stray characters that you may need to quote mostly for convenience. Doing anything else requires heuristics and guesswork.