Journaling filesystems and the fsync() problem

March 21, 2008

Consider your ordinary journaling filesystem. For simplicity and reliability you have a single, global log in which you put transactions for all of your filesystem activity, instead of anything more complicated. One useful consequence of this global log is that you have now created a filesystem-wide global order of all filesystem events (sometimes called a 'total order'), which will be preserved even if you crash and restart.

(You implicitly had a total order before, but it didn't necessarily survive crashes.)

This sounds great until someone does an fsync() to insure that changes to their particular file are fully stable. That you have a global log means that changes to their file are intermixed with other changes; your log's total order means that you have to commit everything up to the last modification point of their file, regardless of what any particular change modifies.

On a sufficiently busy system, almost all of the changes in the journal log will not be to the file being fsync()'d. Flushing and committing all of these unrelated changes is overhead that just serves to slow down the fsync(), sometimes by quite a lot.

You can get around this, but it generally requires a significantly more complicated filesystem and journal design, which may or may not be considered worth it in general. (Not that many applications actually use fsync(), and many of them are not all that speed sensitive. On the other hand, the exceptions tend to be pretty important.)

Written on 21 March 2008.
« Why you should ratelimit messages that outside things can cause
Why NFS writes to ZFS are sometimes (or often) slow »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Mar 21 23:25:56 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.