Using content hashing to avoid the double post problem

November 30, 2009

For those who have not encountered it, the double post problem (or the double comment problem) happens when your web system is just slow enough to respond that the user clicks 'Post <whatever>' again in their browser and re-submits the same post/comment/what have you. In a straightforwardly implemented system, this results in a second copy of the comment or post appearing.

(This is of course a specific instance of a general double submission problem for all web forms.)

I worried about this problem when writing DWiki's comment system, and the way I chose to deal with it was to use a (cryptographic) hash of the comment's content as the internal name of the comment. Since the contents of repeated posts are the same, they will all have the same name and so no matter what, there would only be one copy of the comment.

(DWiki's code detects the case of trying to post a comment that already exists and quietly tells people that they succeeded.)

To me, the appeal of this approach is that I get all of this for free. I have to generate some internal name for the comment; by making it a hash of the content, I get duplicate suppression without having to do anything extra.

When you take this approach, one of the important things that you need to decide is what makes a comment or a post 'the same', such that two separate submissions should hash to the same name and turn into one. Is it the contents alone, the contents plus the authorship (and if so, what elements of authorship for unauthenticated comments), or the contents plus the authorship plus the time to some resolution?

(For comments specifically, I think that this is going to depend to some extent on what sort of environment you want. Choosing to hash only comment content will have the effect of suppressing duplicate short posts such as 'me too', 'I agree', and so on, even if they're written by different people at different times.)

For DWiki, I chose to hash on the comment context plus the authorship, which includes the IP address. This will usually suppress real duplicate posts but in theory could fail if the comment is being submitted through something where the IP address keeps changing (such as a revolving web proxy, or from a machine that changed IP addresses between two submission attempts).

Written on 30 November 2009.
« Poking around the OpenSolaris codebase (for sysadmins)
The mixed directory/unrelated files VCS problem »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 30 22:35:57 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.