The rewriting problem on ZFS and other 'log structured' filesystems

December 4, 2008

When they are planning how to organize their IO, programs (and designers) normally assume that rewriting (overwriting) an existing, already written file is the fastest sort of write operation possible and will not change the existing file layout (for example, if it has been created to be sequential). This is because all of the data blocks have already been allocated and all of the metadata has been set up; if you write in block-aligned units, pretty much all the operating system has to do is shovel data into disk sectors.

(None of this applies if you do something that makes the operating system discard the existing data block allocations, for example truncating the file before starting to rewrite it.)

Filesystems like ZFS break this assumption, because one of their fundamental principles is that they never overwrite existing blocks (this gives them great resilience, enables cheap snapshots, and so on). On what I am inaccurately calling a 'log structured' filesystem, rewriting a block requires allocating a new block and hooking it into the file's metadata, which is at least as expensive and slow as writing it in the first place. As a side effect of allocating new blocks, it will change the file layout in somewhat complicated ways depending on exactly how you rewrite and how much you rewrite.

If you are rewriting randomly there are two major scenarios, depending on whether you're going to do random reads or sequential reads from the rewritten file; call these the database case and the BitTorrent case. The BitTorrent case is horrible, with slower than expected rewrites followed by read speeds that will probably be at least an order of magnitude slower than expected. The database case is just slower rewrites than you expected with access time no worse than before (since we assume random access already), but remember that even databases periodically do sequential table scans (probably especially for backups).

(If you preallocate a file through whatever special OS interface is available for this, it's possible that a log structured filesystem could still preserve a sequential layout while letting you do random writes. I don't know if filesystems are this smart yet, and in specific I don't know if ZFS even has interfaces for this. And this only works if you are rewriting each block only once, so it doesn't help the general database case.)

Written on 04 December 2008.
« Mapping IP addresses to ASNs
A little gotcha when implementing shell read »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 4 01:02:22 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.