Wandering Thoughts archives

2009-11-30

Using content hashing to avoid the double post problem

For those who have not encountered it, the double post problem (or the double comment problem) happens when your web system is just slow enough to respond that the user clicks 'Post <whatever>' again in their browser and re-submits the same post/comment/what have you. In a straightforwardly implemented system, this results in a second copy of the comment or post appearing.

(This is of course a specific instance of a general double submission problem for all web forms.)

I worried about this problem when writing DWiki's comment system, and the way I chose to deal with it was to use a (cryptographic) hash of the comment's content as the internal name of the comment. Since the contents of repeated posts are the same, they will all have the same name and so no matter what, there would only be one copy of the comment.

(DWiki's code detects the case of trying to post a comment that already exists and quietly tells people that they succeeded.)

To me, the appeal of this approach is that I get all of this for free. I have to generate some internal name for the comment; by making it a hash of the content, I get duplicate suppression without having to do anything extra.

When you take this approach, one of the important things that you need to decide is what makes a comment or a post 'the same', such that two separate submissions should hash to the same name and turn into one. Is it the contents alone, the contents plus the authorship (and if so, what elements of authorship for unauthenticated comments), or the contents plus the authorship plus the time to some resolution?

(For comments specifically, I think that this is going to depend to some extent on what sort of environment you want. Choosing to hash only comment content will have the effect of suppressing duplicate short posts such as 'me too', 'I agree', and so on, even if they're written by different people at different times.)

For DWiki, I chose to hash on the comment context plus the authorship, which includes the IP address. This will usually suppress real duplicate posts but in theory could fail if the comment is being submitted through something where the IP address keeps changing (such as a revolving web proxy, or from a machine that changed IP addresses between two submission attempts).

HashingForDoublePosts written at 22:35:57; Add Comment

2009-11-10

HTML5 may end up giving us real, working XHTML

Here is a thesis: the potential success of HTML5 is the web's best chance to get XHTML (in some version, in this case 'XHTML5') in a usable form.

In theory, XHTML 1.0 was just an XML serialization of HTML 4. In practice, this never was the case; there were real semantic changes between HTML 4.01 strict and XHTML 1.0, things that the XHTML people just couldn't resist changing. This meant that supporting XHTML 1.0 required real work in browsers; you could not just de-serialize XHTML into the same internal format as you used for HTML 4.01 and treat it as such afterwards.

However, there is no semantic difference between HTML5 and XHTML5; they really are the same thing being serialized in a different way (partly because both forms are being specified in one go, by the same standards group). This makes it much easier for browser vendors to support XHTML5; if you can handle HTML5 you can handle XHTML5 (and if you can handle HTML 4.01 you can pretty much handle HTML5). This means that, at least in theory, all that browsers need to do to support XHTML5 is to enable XML de-serializing.

(Now, I have to admit that I suspect that XML de-serializing is not trivial in real browsers. XML allows a wide variety of games to be played with namespaces and the like, plus strict XML error handling likely has unpleasant implications for what you can do to start processing an XHTML5 web page before you've received all of it and thus can verify that it is indeed well-formed.)

HTML5AndXHTML written at 00:46:56; Add Comment

2009-11-02

XHTML vs HTML5

Every so often these days, an XHTML fanatic goes off on a rant about how HTML5 is all a horrible mistake and a nightmare (this one is about typical for the breed, and is what set me off in turn). These people are committing a simple mistake: they misunderstand the nature of the world.

It's really simple; XHTML and HTML5 are entirely different sorts of standards. XHTML is an invented standard and has failed because it had very high costs of implementation and use and provided almost no functional difference from other standards (both formal and de facto) in a crowded and mature field (that field being web pages). This is pretty much what you'd expect.

(Some people will protest most strongly that XHTML has not failed. To them, I note that something like 70% of the web browsers currently in use can't display XHTML by default, even if content authors sometimes get fooled about it.)

By contrast, HTML5 is mostly taking a documentation and coordination approach; this makes it much less risky and much more likely to succeed. Since it is taking these approaches, it is in no position to throw away backwards compatibility with HTML4 and do things like venture into the grand world of completely well-formed XML.

(Never mind that the grand world of completely well-formed XML is not realistic.)

Complaining about HTML5 not inventing things is missing the point (and misunderstanding the world). It's not that sort of standard, and it's not that sort of standard precisely because invention standards have failed in the browser; manifestly, you cannot get browser vendors to pay much attention to them (especially the 70% gorilla of Microsoft IE). The people behind HTML5 learned from the failure of XHTML, even if the XHTML fanatics have not, and to be blunt, the result is that HTML5 might just get implemented in a useful way.

(See also Mark Pilgrim, where you can find out how to use HTML5 today on common browsers.)

XHTMLMisunderstanding written at 00:53:16; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.