Wandering Thoughts archives

2007-10-28

The inconvenience of some DWiki design choices

When I designed DWiki back in the mists of time, I decided that it would not require any writable data store (and certainly not an actual database, because that would have required me to configure and run one); I wanted to be able to run it completely read-only, so I could worry less about its security implications. I also decided that it should not have explicit metadata embedded in page data, which are all in regular text files.

Blog entries have one primary time associated with them, their publication date; Atom feed entries have two times associated with them, their original publication date and their most recent (significant) update. Normally you drive these from entry metadata, or if the user doesn't supply it you keep track of it internally, but DWiki doesn't have either of those options.

With very few sources of entry timestamps, DWiki has to use the Unix file modification time of the file for the entry's page as a blog entry's publication date. This leaves it short of an Atom last updated time; the only even vaguely sensible choice is the file 'inode change time', which updates any time you sneeze on the file. Unfortunately this means that the last updated time may change when there was no actual update because something touched the file in a way that changed the file ctime, but things usually work out because this is pretty rare.

(I cannot drop the last updated time entirely because it is mandatory for Atom feed entries; I cannot make it the same as the original publication date without it sometimes being a lie.)

The big drawback for all of this came up last Tuesday, when this server was migrated to new hardware (running a different OS), complete with copying all of our files over. This updated the ctime on all of my files to right then, when they were restored on the new machine, which promptly 'updated' every entry in my Atom feed to right then (with peculiar consequences for at least the LiveJournal feed, which scrambled the entry order and as a result reran some old entries).

Sidebar: post-publication editing of entries

One inconvenience of this collection of design choices is that if I edit an entry after publication, I have to carefully reset its file mtime afterwards (I have a script for this). This is part of why the file ctime has to be the last updated time, because it is the only timestamp that changes when I do this.

Fortunately I rarely feel the urge (or have the need) to edit entries after publication. Arguably this is a good thing.

DesignInconvenience written at 23:30:51; Add Comment

2007-10-22

The dangerous appeal of the obvious

For reasons that do not fit into the margins of this entry, DWiki sometimes needs to get the current Unix load average. I developed DWiki on Linux, and getting the load average on Linux is really easy; you read a line from /proc/loadavg, split it into three floating point numbers, and you're done. So I wrote a get_load function that did that and forgot about the whole thing.

Then this server got changed from Linux to FreeBSD, and suddenly that code didn't work any more. (Because I had been paranoid, it didn't fail explosively; I had assumed that someday system problems might cause things to fail and coped with it.)

FreeBSD doesn't have a /proc/loadavg; instead it has a getloadavg(3) function in the C library. I gloomily contemplated how to make a C library call from Python, and on the off chance someone had already written an extension module to do it I did a Google search on [python getloadavg].

Which promptly turned up the general and supported Python function to do just this, os.getloadavg(). This not only solved my problem but would have saved me the effort of writing my get_load function in the first place, if only I had thought to look for it instead of leaping on the obvious way of getting the load average on Linux that I already knew about.

This is the dangerous appeal of the obvious: just because I know how to do something doesn't mean that I know the best way to do something. Maybe I should keep looking slightly harder, just to make sure.

(You could say my issue with rounding up was the same effect in action, although it was less obvious to me then.)

DangerousObviousAppeal written at 21:42:03; Add Comment

2007-10-18

The Python marshal module versus the cPickle module

The marshal module looks interesting for persisting and retrieving lightweight data, but the big question to me has always been whether in exchange for constraining your data down to simple structures of primitive types you got something that was actually faster than the cPickle module.

So today I decided to finally answer the question by doing some timing tests. I won't claim that these are comprehensive or entirely scientific, but I do have some results:

  • the speed difference is mostly in dumping things; marshal and cPickle generally load things as fast as each other (sometimes cPickle has the edge, sometimes marshal).

  • marshal is significantly faster on nested dictionaries.
  • marshal is generally a bit faster than cPickle.

  • however, marshal really suffers on long strings, and it gets worse the longer the string is; for example, it dumps 2048 byte strings ten times slower than cPickle does.
  • neither marshal nor cPickle are very good at Unicode strings. cPickle suffers the worse relative slowdown, especially for loading; it becomes 18 times slower on my sample string, although this is still no slower than marshal's time.

    (This is one of the rare cases when cPickle dumps much faster than it loads.)

  • marshal is significantly worse on floating point numbers, especially a list of them, although not as badly as on strings (it's only about twice as slow as cPickle for a single floating point number).

Since DWiki's cache layer spends a lot of time writing and reading long strings, it looks like I made the right decision way back when. The combination of long strings and floating point numbers meant that marshal was significantly slower than cPickle for a simulated DWiki cache object.

The general parity in loading time suggests that even for simple data structures, for a caching layer you might as well use cPickle; you are not particularly slower for the thing you're going to be doing a lot, and you get a bunch of (potential) benefits in return.

MarshalVsCPickle written at 23:19:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.