Ext4, SSDs, and RAID stripe parameters

October 20, 2017

I was recently reading Testing disks: Lessons from our odyssey selecting replacement SSDs (via). In this article, the BBC technical people talk earnestly about carefully picking stride and stripe width size values for ext4 on their SSDs and point to this blog post on it. Me being me, I immediately wondered what effects these RAID-related settings actually had in ext4, so I headed off for the kernel source code to take a look. The short spoiler is 'not as much as you think'.

First, setting both the stripe size and the stride width is redundant as far as the kernel's ext4 block allocation goes; the kernel code only uses one of the two, preferring the stripe width if possible (see ext4_get_stripe_size in fs/ext4/super.c). Setting the stride as well does have a small effect on the layout of an ext4 filesystem; it appears to cause some metadata structures to be pushed up to start on a stride boundary when mke2fs creates the filesystem.

(In the kernel, the stripe width and stride are ignored if they're larger than the number of blocks per block group. According to Ext4 Disk Layout and various other sources, there are normally 32,768 filesystem blocks per block ground, for a block group size of 128 MBytes, so this probably won't be an issue for you.)

As far as I can tell from trying to understand mballoc.c, the stripe size only has a few effects on block allocation. First, if your write is for an exact multiple of the stripe size, ext4 will generally try to align it to a stripe boundary if possible (assuming there's sufficient unfragmented free space). This is especially likely if you write exactly one stripe's worth of data.

The second use is more complicated (and I may not understand it correctly). For small files, Ext4 allocates space out of 'locality groups', which are given preallocated space in bulk that they can then parcel out (among other things, this keeps small files together on disk). When you have a stripe size set, the size of each locality group's preallocated space is rounded up to a multiple of the stripe size and I believe it's aligned with stripe boundaries. Individual allocations within a locality group's preallocated space don't seem to be aligned to the stripe size if they're not multiples of it.

Comments in the source code suggest that the goal in both cases is to avoid fragmenting stripes and fragmenting things across stripes. However, it's not clear to me that most allocations particularly avoid doing either; certainly they don't explicitly look at the relevant C variable that holds the stripe size.

Having gone through reading the ext4 kernel code, my overall conclusion is that you should benchmark things before you assume that setting the RAID stripe width and stride is doing anything meaningful on ext4 on a SSD. Also, for maximum benefit it seems very likely that you want your applications to do their large writes in multiples of whatever stripe width you set. Of course, writing data out in erase-block sized chunks seems like a good idea in general; regardless of alignment issues, it probably gives the SSD firmware its best chance to avoid read-modify-write cycles.

(When you test this, you may want to use blktrace to make sure that ext4 is actually issuing large right-sized writes out to the SSD and isn't doing something problematic like slicing them up into smaller chunks. Some block IO tuning may turn out to be necessary.)

Written on 20 October 2017.
« Using Shellcheck is good for me
Multi-Unix environments are less and less common now »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 20 01:03:17 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.