Thinking about filesystem space allocation policies and SSDs

May 2, 2024

Historically, many filesystems have devoted a significant amount of effort to sophisticated space allocation policies. For example, in Unix one of the major changes from V7 to 4.x BSD was the change to the Berkeley Fast File System (also) with its concept of 'cylinder groups' that drastically improved the locality of file data, directory data, and inodes. Various other (Unix) filesystem allocation related technologies have been developed since, for example the idea of delaying deciding where exactly data will live in the filesystem until it's about to be written out, which allows the filesystem to group data better (especially in the face of the fsync problem, where only some of the data may get written out right now).

Traditionally, filesystems really cared about this (and spent so much effort on allocation policies) because disk seeks (on HDDs) were very expensive and issuing extra commands to disks was somewhat expensive even when they didn't require seeks. Solid state disks demolish much of this. Obviously they don't 'seek' as such, and their internal divisions are opaque (and they change, as logical blocks are rewritten on different areas of internal flash). SATA SSDs do still have some limits on the number of commands that can be issued to them, and I believe SAS SSDs do as well. NVMe SSDs famously can handle huge numbers of commands and I believe generally do better with multiple commands being issued to them at once. I believe that there is still an advantage on NVMe SSDs to doing relatively large IOs, so even a SSD-focused filesystem would like to store data in large contiguous chunks rather than scattering its data randomly across the NVMe's storage in 4 Kbyte chunks.

Where this becomes potentially relevant to ordinary people running systems (as opposed to filesystem authors) is that some filesystems will switch between different space allocation strategies depending on various things, like how much free space is left on the filesystem. If you're using SATA/SAS SSDs or especially NVMe SSDs, it may make sense to change when this strategy shift occurs. However, if you have a generally low rate of writes, it's probably not going to make much of a difference (this is Amdahl's Law poking its head up again).

(However, you may have periodic periods of high write rates where you really care about the write latency and thus you care about this issue along with things like disk write buffering and its interactions with write flushes.)

In addition, sometimes what the filesystem is switching between is not really a faster or a slower allocation strategy but instead, for example, how fragmented free space gets (for example, ZFS space allocation from metaslabs). Even if the 'more fragmented' option is faster, you may not want to change where that mode starts (or ends) unless you really know what you're doing.

(Space allocation isn't the only place where filesystems have or had tuning and settings for HDDs that aren't necessarily applicable to SSDs.)

Written on 02 May 2024.
« Having a machine room can mean having things in your machine room
UEFI, BIOS, and other confusing x86 PC (firmware) terms »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 2 23:14:45 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.