The details behind
zpool list's new fragmentation percentage
In this entry I explained that
zpool list's new FRAG field is a measure of how fragmented
the free space in the pool is, but I ignored all of the actual
details. Today it's time to fix that, and to throw in the general
background on top of it. So first we need to start by talking
about free (disk) space.
All filesystems need to keep track of free disk space somehow. ZFS does so using a number of metaslabs, each of which has a space map; simplifying a bunch, spacemaps keep track of segments of contiguous free space in the metaslab (up to 'the whole metaslab'). A couple of years ago, a new ZFS feature called spacemap_histogram was added as part of a spacemap/metaslab rework. Spacemap histograms maintain a powers-of-two histogram of how big the segments of free space in metaslabs are. The motivation for this is, well, let me just quote from the summary of the rework:
The current [pre-histogram] disk format only stores the total amount of free space [in a metaslab], which means that heavily fragmented metaslabs can look appealing, causing us to read them off disk, even though they don't have enough contiguous free space to satisfy large allocations, leading us to continually load the same fragmented space maps over and over again.
(Note that when this talks about 'heavily fragmented metaslabs' it means heavily fragmented free space.)
To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from '0' for the 16 MB and larger buckets down to '100' for the 512 byte bucket, and then well, once again I'll just quote directly from the source:
This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.
My first entry summarized the
current values in the table, or you can read the actual
table in the source code. There is one important bit that is not
in the table at all, which is that a metaslab with no free space
left is considered 0% fragmented.
A pool's fragmentation value is derived in a two step process, because metaslabs are actually grouped together in 'metaslab groups' (I believe each vdev gets one). All metaslabs in a metaslab group are the same size, so the fragmentation for a metaslab group is just the average fragmentation over all metaslabs with valid spacemap histograms. The overall pool fragmentation is then derived from the metaslab group fragmentations, weighted by how much total space each metaslab group contributes (not how much free space).
A sufficiently recent pool will have spacemap histograms for all
metaslabs. A pool that was created before this feature was added
but then upgraded may not have spacemap histograms created for all
of its metaslabs yet (I believe that a spacemap histogram is only
added if the metaslab spacemap winds up getting written out with
changes). If too many metaslabs in any single metaslab group lack
spacemap histograms, the pool is considered to not have an overall
fragmentation percentage (
zpool list will report this as a FRAG
value of '
-', even though the spacemap_histogram feature is
(Currently 'too many metaslabs' is 'half or more of the metaslabs in a metaslab group', but this may change.)
You can inspect raw metaslab spacemap histograms through
zdb -mm <POOL>'. Note that the on-disk histogram has more
buckets than the fragmentation percentage table does (it has 32
zfs_frag_table's 17). The bucket numbers printed
represent raw powers of two, eg a bucket number of 10 is 2^10 bytes
or 1 KB; this implies that you'll never see a bucket number smaller
than the vdev's ashift. Zdb also reports the calculated fragmentation
percentage for each metaslab (as '
(It looks like mdb can also dump this information when it is
reporting on appropriate vdevs, via '
::vdev -m'. I have not
investigated this, just noticed it in the source.)
The metaslab fragmentation number is used for more than just reporting
a metric in
zpool list. There are a number of bits of ZFS block
allocation that pay attention to it when deciding what metaslab to
allocate new space from. There are also some ZFS global variables
related to this, but since I haven't dug into this area at all I'm
not going to say anything about them.
(In the Illumos source, all of this is in uts/common/fs/zfs/metaslab.c; you want to search for all of the things that talk about fragmentation. Note that there's multiple levels of functions involved in this.)