Wandering Thoughts archives

2015-12-05

The details behind zpool list's new fragmentation percentage

In this entry I explained that zpool list's new FRAG field is a measure of how fragmented the free space in the pool is, but I ignored all of the actual details. Today it's time to fix that, and to throw in the general background on top of it. So first we need to start by talking about free (disk) space.

All filesystems need to keep track of free disk space somehow. ZFS does so using a number of metaslabs, each of which has a space map; simplifying a bunch, spacemaps keep track of segments of contiguous free space in the metaslab (up to 'the whole metaslab'). A couple of years ago, a new ZFS feature called spacemap_histogram was added as part of a spacemap/metaslab rework. Spacemap histograms maintain a powers-of-two histogram of how big the segments of free space in metaslabs are. The motivation for this is, well, let me just quote from the summary of the rework:

The current [pre-histogram] disk format only stores the total amount of free space [in a metaslab], which means that heavily fragmented metaslabs can look appealing, causing us to read them off disk, even though they don't have enough contiguous free space to satisfy large allocations, leading us to continually load the same fragmented space maps over and over again.

(Note that when this talks about 'heavily fragmented metaslabs' it means heavily fragmented free space.)

To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from '0' for the 16 MB and larger buckets down to '100' for the 512 byte bucket, and then well, once again I'll just quote directly from the source:

This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.

My first entry summarized the current values in the table, or you can read the actual zfs_frag_table table in the source code. There is one important bit that is not in the table at all, which is that a metaslab with no free space left is considered 0% fragmented.

A pool's fragmentation value is derived in a two step process, because metaslabs are actually grouped together in 'metaslab groups' (I believe each vdev gets one). All metaslabs in a metaslab group are the same size, so the fragmentation for a metaslab group is just the average fragmentation over all metaslabs with valid spacemap histograms. The overall pool fragmentation is then derived from the metaslab group fragmentations, weighted by how much total space each metaslab group contributes (not how much free space).

A sufficiently recent pool will have spacemap histograms for all metaslabs. A pool that was created before this feature was added but then upgraded may not have spacemap histograms created for all of its metaslabs yet (I believe that a spacemap histogram is only added if the metaslab spacemap winds up getting written out with changes). If too many metaslabs in any single metaslab group lack spacemap histograms, the pool is considered to not have an overall fragmentation percentage (zpool list will report this as a FRAG value of '-', even though the spacemap_histogram feature is active).

(Currently 'too many metaslabs' is 'half or more of the metaslabs in a metaslab group', but this may change.)

You can inspect raw metaslab spacemap histograms through zdb, using 'zdb -mm <POOL>'. Note that the on-disk histogram has more buckets than the fragmentation percentage table does (it has 32 entries versus zfs_frag_table's 17). The bucket numbers printed represent raw powers of two, eg a bucket number of 10 is 2^10 bytes or 1 KB; this implies that you'll never see a bucket number smaller than the vdev's ashift. Zdb also reports the calculated fragmentation percentage for each metaslab (as 'fragmentation NN').

(It looks like mdb can also dump this information when it is reporting on appropriate vdevs, via '::vdev -m'. I have not investigated this, just noticed it in the source.)

The metaslab fragmentation number is used for more than just reporting a metric in zpool list. There are a number of bits of ZFS block allocation that pay attention to it when deciding what metaslab to allocate new space from. There are also some ZFS global variables related to this, but since I haven't dug into this area at all I'm not going to say anything about them.

(In the Illumos source, all of this is in uts/common/fs/zfs/metaslab.c; you want to search for all of the things that talk about fragmentation. Note that there's multiple levels of functions involved in this.)

solaris/ZFSZpoolFragmentationDetails written at 01:38:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.