The story of leaving N% of your filesystem unused for performance

April 5, 2011

In a comment on on Please don't alert based on percentages, Joce wrote:

I don't know if the example given is correct. I've heard that performances will drop dramatically when disk usage is above 80%.

Maybe, but probably not. This is almost one of those Unix legends like 'sync; sync; sync', but it's not quite entirely legendary. To explain, I need to go back to where this story comes from.

The original V7 Unix filesystem had extremely simple block and inode allocation policies and as a result it fragmented essentially instantly, insuring a rapid degradation in performance. As part of McKusick's work on the Fast Filesystem as part of 4.x BSD, he fixed this by dividing the filesystem up into allocation groups; this let BSD try to allocate related inodes and blocks close to each other, reducing fragmentation and keeping up the performance of the filesystem over time.

Once you have allocation groups, allocating blocks and inodes is no longer as fast and simple as 'find the first free object and go'. Instead you have to go through a multi-step procedure where you pick your best candidate allocation group, check to see if it has enough space, and either find where the space is inside or go try another. Since the BSD people were thinking ahead, they didn't assume that all of the data structures necessary to do this would fit into memory at once; sometimes doing this searching requires some disk IO.

(Or at least I think that 4.x BSD didn't assume that all information for all allocation groups for all filesystems could fit into kernel memory.)

When most allocation groups have free space, especially lots of free space, this approach runs pretty fast. Even if your primary allocation group is full, you'll probably only have to search one or two more in order to find a place to put stuff. But the fuller the filesystem is and the fewer allocation groups there are left with free space, the longer this approach has to search in order to find free space. Therefor, BSD advised people to leave 10% space free in order for the FFS allocation algorithms to still run well enough even as your filesystem filled up. I don't know if they based the 10% figure on solid experimentation or theory, or if they just took an educated guess.

(Beyond any write slowdown, as the filesystem filled up the data you were writing might become more and more fragmented as it had to be placed wherever there was space, instead of grouped together where it 'should' go.)

This 10% was (perhaps) valid for BSD 4.x's FFS, with their code, their algorithms, their in-memory and on-disk data, and the machines and disks (and disk sizes and allocation group sizes and so on) of the time. All of those things have changed since then.

The problem is not that the 10% rule (or the 20% rule) is wrong or right today; it is that there is no general rule. Some filesystems, in some environments, on some sizes of disks and with certain filesystem parameters, may have this issue where a sufficiently full filesystem significantly degrades allocation speed and fragments data that you care about. Others will not. As a minimum, whether or not this happens is specific to the filesystem implementation, because a lot depends on how much the filesystem keeps in memory (and how it does allocations).

(A filesystem that can immediately find an allocation group that has enough space is in much better shape than one that may need to check several allocation groups, potentially reading things off disk each time. And little details can matter here.)

It's also worth noting that whether or not this even matters to you depends a lot on your IO patterns and exactly how you're doing IO. For example, if your IO load is almost entirely random read IOs, you don't care much about either slow writes or filesystem fragmentation; you're probably already seek limited no matter what the filesystem does. And while a database may be doing a lot of writes, it is probably not doing a lot of block and inode allocation through the filesystem.

(Unless, of course, you're running on a copy-on-write filesystem like ZFS. See what I mean about the filesystem mattering here?)

PS: for more background on this, see sources such as Wikipedia. Hopefully I have all of the BSD history bits correct, since I initially learned a certain amount of this through Unix folklore.

Written on 05 April 2011.
« Why logging to syslog is different than logging to standard error
Monkey subclassing for fun and profit »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 5 23:54:15 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.