What you need for migrating web content

April 16, 2012

In light of the wiki trap and the question of how to manage content, let's start with a basic question: suppose that you want to migrate from one way of storing and managing web content to another (such as in to or out of a wiki). What things do you need from your current content system?

Based on our experiences here, my answer is that the minimum you need is:

  • a list of the important URLs you have now, especially URLs with essentially static content (let's call these 'content pages').

    (Don't assume that this list of URLs is obvious or that of course any random content handling system will give it to you. Neither is true, and if you don't have a list of URLs you get to spider your own site to recover it.)

  • each chunk of content in basic straightforward HTML, labeled with the URL it belongs to. When a page is actually composed of multiple chunks of content (for example a blog entry and a series of comments on it) you should get the HTML for each chunk of content separately.

  • the metadata for each chunk of content.

    (What exact metadata depends on what metadata your current system collects and maintains. A basic set of HTML pages probably has very little metadata; a blog based site with comments might have a lot.)

With HTML for your content, information about what URLs you need to either recreate or redirect, and the important metadata you can rebuild a basic version of your web area in either another system or with plain HTML pages. How much you need metadata depends on how much use you make of it. A support web area may not really care about things like authorship and publication date while a blog cares a lot, especially once you get to comments.

(I've reluctantly concluded that exporting any edit history is not usually important enough to be included in the minimum list. It's nice to have but in general not having it is not going to break your site, because usually the current page contents are the important stuff.)

The immediate corollary is that if you are picking a system to hold your web content and you care about not getting caught in the wiki trap, you want to make sure that your chosen system can give you all three of these things. Having these things won't make conversion painless, but it makes it much more feasible.

(In practice the question is never whether or not conversion is possible; it's always possible if you work hard enough, because in the limit case you manually crawl your website, copy the HTML of your content, de-mangle it, and so on. The real question is whether a conversion is easy enough to actually be done.)

Written on 16 April 2012.
« My view on why CISC (well, x86) won over RISC in the end
ls -l should show the presence of Linux capabilities »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 16 02:31:20 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.