Files and fundamental filesystem activities (on Unix)

November 8, 2011

Back in a discussion of filesystem deduplication I said that writing blocks is a fundamental filesystem activity while writing files is not. On the surface this sounds like a strange thing, so today I'm going to defend it.

At one level, it's clear how writing blocks is a fundamental filesystem activity. Filesystems allocate disk space in blocks and pretty much only write blocks; if you try to write less than a block, the filesystem actually usually does a 'read modify write' cycle. Although this was once forced by physical disk constraints, that's no longer true today; until recently, disks used smaller physical blocks than the filesystem block size, so the filesystem could do sub-block writes if it wanted to. Filesystems just don't, by and large.

What's not clear is why writing files is not. To see why, let's ask a question: what does it mean to write a file, and when are you done? In the simple case the answer is that you write all of the data in the file in sequential order, and then close the file descriptor. This probably describes a huge amount of the file writes done on a typical Unix system, and it's certainly what most people think of, since this describes things like saving a file in an editor or writing out an image in your image editor. But there's a lot of files on Unix that aren't 'written' this way. Databases (SQLite included) are the classic case, but there are other examples; even rsync may 'write' files in non-sequential chunks in some situations. Some of these cases may not close the file for days or weeks, although they may go idle for significant amounts of time.

The result is that writing files is a diffuse activity while writing blocks is a very sharp one. You can clearly write to a file in a way that touches only a small portion of the file, and if the end of writing a file is when you close it you can write files very slowly, with huge gaps between your actual IO. And the system makes all of these patterns relatively efficient, unlike partial-block writes.

This cause problems for a number of things that want to react when a file is written. File level deduplication is one example; another is real time virus scanners, even with system support to hook events.

(The more I think about it, the more I think that this is not just a Unix thing. Although I may have blinkered vision due to Unix, it's hard to see a viable API that could make writing files a fundamental activity. There's many situations where you just can't pregenerate all of the file before writing it even if you're writing things sequentially, plus there's random write IO to consider unless you make that an entirely separate 'database' API.)


Comments on this page:

From 109.76.102.66 at 2011-11-08 07:39:04:

Good points.

A note about treating files as units. Currently as you say, cp for example will just read a stream of bytes and write a stream of bytes. Consequences are that one is not told in advance if there is no space in the destination. Also there is a greater chance of fragmentation in the presence of simultaneous writing. There is a new fallocate() call in Linux file systems at least, which can be called first with the size of the file, to avoid the above issues.

Written on 08 November 2011.
« An IPSec mystery with dropped packets
The disappearance of separate filesystems for /usr and /var »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Nov 8 01:14:58 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.