Wandering Thoughts archives

2012-01-30

HTML is not a SGML dialect and never really has been

There is a persistent story that makes the rounds among the web specification world (for example, in this otherwise realistic article on XHTML) that HTML is a SGML dialect but web browsers persistently mishandle and mis-parse certain SGML features such as minimization. Although I have pandered to this belief before, it is false in practice and in reality.

HTML is really a documentation standard; the standard followed behind existing practice, not preceded it. In the very beginning, people just created browsers and a vague format that the browsers understood. This format was inspired by SGML, but it was never an SGML dialect and as such it never had various obscure SGML features. At some point, when people in the W3C were writing down the HTML standard of the time (or perhaps evolving it), they decided to 'fix' this obvious omission by writing into the new version of the HTML specification that it was a SGML dialect.

(Looking at the historical specifications via wikipedia, this appears to go as far back as HTML 2.0.)

You can guess what happened next. All of the browsers of the time promptly ignored this new bit of the standard, and pretty much every browser written since then has as well; none of them ever parsed HTML as SGML, supporting all of the little odd SGML features that that implies. HTML may be an SGML dialect as far as the W3 standards and their validator are concerned, but it is not in real life and anyone who writes HTML believing otherwise is going to have problems.

As you might expect, HTML5 very firmly puts a stake in this particular issue; the current spec draft says explicitly (emphasis mine):

For compatibility with existing content and prior specifications, this specification describes two authoring formats: one based on XML (referred to as the XHTML syntax), and one using a custom format inspired by SGML (referred to as the HTML syntax).

Perhaps someday all of the common HTML validators will be updated to understand HTML as it really is.

HTMLAndSGML written at 15:33:28; Add Comment

2012-01-19

How not to do repeated fields in web forms

There's a certain sort of web form which really wants to make sure that you've entered something correctly, so they ask you to enter it twice in two different fields. You've probably seen this in some web form sooner or later; this is the 'please enter your password again in this field too' or 'please re-enter your email address' field. I tend to think that this is bad on its own, but I've now seen an even worse implementation of this basic idea, which I'll call an anti-confirmation field, one that's practically designed to create errors.

What the people behind this did was quite simple: they made it so that their second fields would not accept pasted input (probably using JavaScript, which I had on because I didn't feel like finding out which bits of the registration process required it). I had to retype both my email address and my password by hand, which was especially annoying because I was pasting both of them from elsewhere. I call this an anti-confirmation field because of course retyping things by hand is more error-prone than pasting things in; in fact, I twice made a mistake retyping the password.

(My web password for this site was a strong random password, as usual. Random jumbles are hard to transcribe accurately by hand, especially when they jump back and forth between character case.)

I suspect that the website designers justified this by saying that they were worried about people entering a bad email address by hand in the first field and then 'confirming' it by just cutting & pasting it into the second field. However, even at its best this logic doesn't work for password fields since browsers don't let you copy the plaintext content of a password field once you've entered it. I also suspect that the designers do not have any actual data on how many genuine errors this prevents (versus how many artificial errors are created).

Sidebar: how to measure the numbers

Assuming that you've committed yourself to (anti-)confirmation fields in the first place, you just need to track field values across time when a submission fails because of mismatched fields. In a transcription error the first of the two fields will turn out to be correct (ie, the same as the final submitted value) and the second field will change. In a genuine error the first field will be different between the failed submission and a subsequent valid one.

Doing this with email addresses raises basically no security issues. If you do this with the password field you'll want to one-way hash them somehow in your tracking data.

AntiConfirmationFields written at 22:59:00; Add Comment

2012-01-17

The first browser blinks on XHTML parsing

I'm late to the party, but Opera has decided to stop strict parsing of XHTML (via Sam Ruby):

[...] we've decided to stop throwing draconian XML parsing failed error messages [on invalid XHTML], and instead, attempt to reparse the document automatically as HTML.

I have long said that draconian XHTML error handling is an unstable equilibrium and it would only last as long as all of the browser vendors didn't blink. Well, Opera has blinked; they've picked the user friendly alternative over the strictly standards compliant one (or semi-strict, since they apparently already offered an option to reinterpret the page as HTML, unlike eg Firefox). It now remains to be seen how long it will be before other browser vendors do the same thing.

(I expect Firefox to be the last holdout because Firefox people are in some ways very user hostile in the name of doing 'the right thing'.)

While this Opera blog entry was about a development snapshot, the announcement for Opera 11.60 mentions this as a feature of 11.60. So this is now out there in the wild in a general release browser.

(Now I'm wondering if someone has or could make Firefox extension to do the same thing.)

XHTMLSomeoneBlinks written at 15:23:22; Add Comment

These are my WanderingThoughts
(About the blog)

Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.
Twitter: @thatcks

* * *

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

This is a DWiki.
GettingAround
(Help)

Search:
By day for January 2012: 17 19 30; before January; after January.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.