HTML is not a SGML dialect and never really has been

January 30, 2012

There is a persistent story that makes the rounds among the web specification world (for example, in this otherwise realistic article on XHTML) that HTML is a SGML dialect but web browsers persistently mishandle and mis-parse certain SGML features such as minimization. Although I have pandered to this belief before, it is false in practice and in reality.

HTML is really a documentation standard; the standard followed behind existing practice, not preceded it. In the very beginning, people just created browsers and a vague format that the browsers understood. This format was inspired by SGML, but it was never an SGML dialect and as such it never had various obscure SGML features. At some point, when people in the W3C were writing down the HTML standard of the time (or perhaps evolving it), they decided to 'fix' this obvious omission by writing into the new version of the HTML specification that it was a SGML dialect.

(Looking at the historical specifications via wikipedia, this appears to go as far back as HTML 2.0.)

You can guess what happened next. All of the browsers of the time promptly ignored this new bit of the standard, and pretty much every browser written since then has as well; none of them ever parsed HTML as SGML, supporting all of the little odd SGML features that that implies. HTML may be an SGML dialect as far as the W3 standards and their validator are concerned, but it is not in real life and anyone who writes HTML believing otherwise is going to have problems.

As you might expect, HTML5 very firmly puts a stake in this particular issue; the current spec draft says explicitly (emphasis mine):

For compatibility with existing content and prior specifications, this specification describes two authoring formats: one based on XML (referred to as the XHTML syntax), and one using a custom format inspired by SGML (referred to as the HTML syntax).

Perhaps someday all of the common HTML validators will be updated to understand HTML as it really is.

Written on 30 January 2012.
Last modified: Mon Jan 30 15:33:28 2012
