Wandering Thoughts archives

2009-11-30

Using content hashing to avoid the double post problem

For those who have not encountered it, the double post problem (or the double comment problem) happens when your web system is just slow enough to respond that the user clicks 'Post <whatever>' again in their browser and re-submits the same post/comment/what have you. In a straightforwardly implemented system, this results in a second copy of the comment or post appearing.

(This is of course a specific instance of a general double submission problem for all web forms.)

I worried about this problem when writing DWiki's comment system, and the way I chose to deal with it was to use a (cryptographic) hash of the comment's content as the internal name of the comment. Since the contents of repeated posts are the same, they will all have the same name and so no matter what, there would only be one copy of the comment.

(DWiki's code detects the case of trying to post a comment that already exists and quietly tells people that they succeeded.)

To me, the appeal of this approach is that I get all of this for free. I have to generate some internal name for the comment; by making it a hash of the content, I get duplicate suppression without having to do anything extra.

When you take this approach, one of the important things that you need to decide is what makes a comment or a post 'the same', such that two separate submissions should hash to the same name and turn into one. Is it the contents alone, the contents plus the authorship (and if so, what elements of authorship for unauthenticated comments), or the contents plus the authorship plus the time to some resolution?

(For comments specifically, I think that this is going to depend to some extent on what sort of environment you want. Choosing to hash only comment content will have the effect of suppressing duplicate short posts such as 'me too', 'I agree', and so on, even if they're written by different people at different times.)

For DWiki, I chose to hash on the comment context plus the authorship, which includes the IP address. This will usually suppress real duplicate posts but in theory could fail if the comment is being submitted through something where the IP address keeps changing (such as a revolving web proxy, or from a machine that changed IP addresses between two submission attempts).

web/HashingForDoublePosts written at 22:35:57; Add Comment

Poking around the OpenSolaris codebase (for sysadmins)

If you do much work with Solaris things like DTrace and mdb -k, you are sooner or later going to want to poke around the OpenSolaris code, both for kernels and for utilities and so on. If you do this very much, you are going to want your own local copy of the OpenSolaris codebase (while you can use the OpenSolaris website, sooner or later navigating through it will drive you mad). You can get a copy with Mercurial; see here for instructions on how. For spelunking purposes, there is little to no reason to get the binary only bits.

(Just to confuse you, OpenSolaris is called 'ON/Nevada' in much of this.)

Now, there's an important caution to this: OpenSolaris source is not the same thing as Solaris source. You can generally use OpenSolaris source as a guide to what you'll find with DTrace et al, but it's not a sure thing (even if you go back in version history), and sometimes you will find important differences. Some of these can be spotted by looking at structure definitions in your own system's include files in /usr/include, but not all of the interesting header files make it there. In some cases you may have to resort to dumping structures with mdb's ::print operation.

Everything useful in the onnv-gate repository lives in usr/src, and I'm going to quote paths relative to this from now on.

  • In general, everything has most of its code in a <whatever>/common subdirectory (for 'code common across all architectures', I assume).

  • mdb source is in cmd/mdb, and the mdb modules are mostly in the common/modules subdirectory. Reading mdb module source can be the best way to find out interesting mdb commands and exactly what they do; this can lead to useful discoveries.

  • most interesting kernel source is in uts/common in a relatively obvious layout. Many internal header files are in the sys/ subdirectory here; others can be found in the source area for their code, eg fs/zfs/sys for internal ZFS headers.

  • ZFS commands rely on some ZFS libraries to do most of the work; they're in lib/libzpool and lib/libzfs. These are what you need to look at if you want to figure out the division between user and kernel space, and also what limitations are artificially imposed by zpool and zfs and what limitations are real.

In general the repository history in the onnv-gate repository is not very useful. Sometimes you can use 'hg log -v' and so on to pick out the specific code change that fixed a bug number that you're interested in, and thus see how applicable to your particular circumstances it may be.

(The other thing I've used the repo history for is to trace the code for a particular ZFS kernel feature that I wanted to use back in time to establish that I would have to use a relatively recent OpenSolaris build in order to get it.)

solaris/PokingOpenSolarisSource written at 01:03:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.