Files and fundamental filesystem activities (on Unix)
November 8, 2011
Back in a discussion of filesystem deduplication I said that writing blocks is a fundamental filesystem activity while writing files is not. On the surface this sounds like a strange thing, so today I'm going to defend it.
At one level, it's clear how writing blocks is a fundamental filesystem activity. Filesystems allocate disk space in blocks and pretty much only write blocks; if you try to write less than a block, the filesystem actually usually does a 'read modify write' cycle. Although this was once forced by physical disk constraints, that's no longer true today; until recently, disks used smaller physical blocks than the filesystem block size, so the filesystem could do sub-block writes if it wanted to. Filesystems just don't, by and large.
What's not clear is why writing files is not. To see why, let's ask a
question: what does it mean to write a file, and when are you done? In
the simple case the answer is that you write all of the data in the
file in sequential order, and then close the file descriptor. This
probably describes a huge amount of the file writes done on a typical
Unix system, and it's certainly what most people think of, since this
describes things like saving a file in an editor or writing out an image
in your image editor. But there's a lot of files on Unix that aren't
'written' this way. Databases (SQLite included) are the classic case,
but there are other examples; even
The result is that writing files is a diffuse activity while writing blocks is a very sharp one. You can clearly write to a file in a way that touches only a small portion of the file, and if the end of writing a file is when you close it you can write files very slowly, with huge gaps between your actual IO. And the system makes all of these patterns relatively efficient, unlike partial-block writes.
This cause problems for a number of things that want to react when a file is written. File level deduplication is one example; another is real time virus scanners, even with system support to hook events.
(The more I think about it, the more I think that this is not just a Unix thing. Although I may have blinkered vision due to Unix, it's hard to see a viable API that could make writing files a fundamental activity. There's many situations where you just can't pregenerate all of the file before writing it even if you're writing things sequentially, plus there's random write IO to consider unless you make that an entirely separate 'database' API.)
Written on 08 November 2011.
* * *