2015-12-28
The limits of what ZFS scrubs check
In the ZFS community, there is a widespread view that ZFS scrubs
are the equivalent of fsck for ordinary filesystems and so check
for and find at least as many error conditions as fsck does.
Unfortunately this view of ZFS scrubs is subtly misleading and can
lead you to expect them to do things that they simply don't.
The simple version of what a ZFS scrub does is that it verifies the checksum for every copy of every (active) block in the ZFS pool. It also explicitly verifies parity blocks for RAIDZ vdevs (which a normal error-free read does not). In the process of doing this verification, the scrub must walk the entire object tree of the pool from the top downwards, which has the side effect of more or less verifying this hierarchy; certainly if there's something like a directory entry that points to an invalid thing, you will get a checksum error somewhere in the process.
However, this is all that a ZFS scrub verifies. In particular, it
does not check the consistency and validity of metadata that isn't
necessary to walk the ZFS object tree. This includes things like
much of the inode data that is returned by stat() calls, and also
internal structural information that is not necessary to walk the
tree. Such information is simply tacitly assumed to be correct if
its checksum verifies.
What this means at a broad level is that while a ZFS scrub guards
against on disk corruption of data that was correct when it was
written, it does not protect against internal corruption of data.
If RAM errors or ZFS bugs
cause corrupt data to be written, a ZFS scrub will not detect it
even though it may be obvious in, for example, a ls -l. This
is not just a theoretical issue,
and has been encountered on multiple platforms.
(I also believe that ZFS scrubs don't try to do full consistency checks on ZFS's tracking of free disk blocks. I'm not sure if they even try to check that all in-use blocks are actually marked that way.)
This means that a ZFS scrub does somewhat different checks than a
traditional fsck. Traditional fsck can't verify block integrity
except indirectly, unlike scrubs, but fsck does a lot of explicit
consistency checks of things like inode modes to make sure they're
sane and it does verify that the filesystem's idea of free space
is correct.
It would be possible to make ZFS scrubs do additional checks, and this may happen at some point. But it is not the state of affairs today, so today you can have a ZFS pool with corruption that never the less passes ZFS scrubs with no errors. In extreme cases, you may wind up with a pool that panics the system. You can do a certain amount of verification yourself, for example by writing a program that walks the entire filesystem to verify that there are no inodes with crazy modes. And if you make your backups with a conventional system that works through the filesystem (instead of with ZFS snapshot replication), your backups will do a certain amount of verification themselves just by walking the filesystem and trying to read all of the files (sooner or later).
2015-12-05
The details behind zpool list's new fragmentation percentage
In this entry I explained that
zpool list's new FRAG field is a measure of how fragmented
the free space in the pool is, but I ignored all of the actual
details. Today it's time to fix that, and to throw in the general
background on top of it. So first we need to start by talking
about free (disk) space.
All filesystems need to keep track of free disk space somehow. ZFS does so using a number of metaslabs, each of which has a space map; simplifying a bunch, spacemaps keep track of segments of contiguous free space in the metaslab (up to 'the whole metaslab'). A couple of years ago, a new ZFS feature called spacemap_histogram was added as part of a spacemap/metaslab rework. Spacemap histograms maintain a powers-of-two histogram of how big the segments of free space in metaslabs are. The motivation for this is, well, let me just quote from the summary of the rework:
The current [pre-histogram] disk format only stores the total amount of free space [in a metaslab], which means that heavily fragmented metaslabs can look appealing, causing us to read them off disk, even though they don't have enough contiguous free space to satisfy large allocations, leading us to continually load the same fragmented space maps over and over again.
(Note that when this talks about 'heavily fragmented metaslabs' it means heavily fragmented free space.)
To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from '0' for the 16 MB and larger buckets down to '100' for the 512 byte bucket, and then well, once again I'll just quote directly from the source:
This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.
My first entry summarized the
current values in the table, or you can read the actual zfs_frag_table
table in the source code. There is one important bit that is not
in the table at all, which is that a metaslab with no free space
left is considered 0% fragmented.
A pool's fragmentation value is derived in a two step process, because metaslabs are actually grouped together in 'metaslab groups' (I believe each vdev gets one). All metaslabs in a metaslab group are the same size, so the fragmentation for a metaslab group is just the average fragmentation over all metaslabs with valid spacemap histograms. The overall pool fragmentation is then derived from the metaslab group fragmentations, weighted by how much total space each metaslab group contributes (not how much free space).
A sufficiently recent pool will have spacemap histograms for all
metaslabs. A pool that was created before this feature was added
but then upgraded may not have spacemap histograms created for all
of its metaslabs yet (I believe that a spacemap histogram is only
added if the metaslab spacemap winds up getting written out with
changes). If too many metaslabs in any single metaslab group lack
spacemap histograms, the pool is considered to not have an overall
fragmentation percentage (zpool list will report this as a FRAG
value of '-', even though the spacemap_histogram feature is
active).
(Currently 'too many metaslabs' is 'half or more of the metaslabs in a metaslab group', but this may change.)
You can inspect raw metaslab spacemap histograms through zdb,
using 'zdb -mm <POOL>'. Note that the on-disk histogram has more
buckets than the fragmentation percentage table does (it has 32
entries versus zfs_frag_table's 17). The bucket numbers printed
represent raw powers of two, eg a bucket number of 10 is 2^10 bytes
or 1 KB; this implies that you'll never see a bucket number smaller
than the vdev's ashift. Zdb also reports the calculated fragmentation
percentage for each metaslab (as 'fragmentation NN').
(It looks like mdb can also dump this information when it is
reporting on appropriate vdevs, via '::vdev -m'. I have not
investigated this, just noticed it in the source.)
The metaslab fragmentation number is used for more than just reporting
a metric in zpool list. There are a number of bits of ZFS block
allocation that pay attention to it when deciding what metaslab to
allocate new space from. There are also some ZFS global variables
related to this, but since I haven't dug into this area at all I'm
not going to say anything about them.
(In the Illumos source, all of this is in uts/common/fs/zfs/metaslab.c; you want to search for all of the things that talk about fragmentation. Note that there's multiple levels of functions involved in this.)
2015-12-02
What zpool list's new FRAG fragmentation percentage means
Recent versions of 'zpool list' on Illumos (and elsewhere) have
added a new field of information called 'FRAG', reported as a
percentage, which the zpool manpage will tell you is 'the amount
of fragmentation in the pool'. To put it politely, this is very
under-documented (and in a misleading way). Based on an expedition
into the current Illumos kernel code, as far as I can tell:
zpool list's FRAG value is an abstract measure of how fragmented the free space in the pool is.
A pool with a low FRAG percent has most of its remaining free space in large contiguous segments, while a pool with a high FRAG percentage has most of its free space broken up into small pieces. The FRAG percentage tells you nothing about how fragmented (or not fragmented) your data is, and thus how many seeks it will take to read it back. Instead it is part of how hard ZFS will have to work to find space for large chunks of new data (and how fragmented they may be forced to be when they get written out).
(How hard ZFS has to work to find space is also influenced by how much total free space is left in your pool. There's likely to be some correlation between low free space and higher FRAG numbers, but I wouldn't assume that they're inextricably yoked together.)
FRAG also doesn't tell you how evenly the free space is distributed across your disk(s). As far as I know, adding a new vdev or expanding an existing one will generally result in the new space being seen as essentially unfragmented; this can drop your overall FRAG percent even if your old disk space had very fragmented free space. In practice this probably doesn't matter, since ZFS will generally prefer to write things to that new (and unfragmented) space.
(Such a drop in FRAG is 'fair' in the sense that the chances that ZFS will be able to find a large chunk of free space have gone way up.)
How the percentages relate to the average segment size of free space goes roughly like this. Based on the current Illumos kernel code, if all free space was in segments of the given size, the reported fragmentation would be:
- 512 B and 1 KB segments are 100% fragmented
- 2 KB segments are 98% fragmented; 4 KB segments are 95% fragmented.
- 8 KB to 1 MB segments start out at 90% fragmented and drop 10% for every power of two (eg 16 KB is 80% fragmented and 1 MB is 20%). 128 KB segments are 50% fragmented.
- 2 MB, 4 MB, and 8MB segments are 15%, 10%, and 5% fragmented respectively
- 16 MB and larger segments are 0% fragmented.
Of course the free space is probably not all in segments of one size. ZFS does the obvious thing and weights each segment size bucket by the amount of free space that falls into that range. This makes FRAG essentially an average, which means it has the usual hazards of averages.
Note that these fragmentation percents are relatively arbitrary, as comments in the Illumos kernel code admit; they are designed to produce what the ZFS developers feel is a useful result, not by following any strict mathematical formula. They may also change in the future. As far as relative values go, according to comments in the source code, 'a 10% change in fragmentation equates to approximately double the number of segments'.
(The source code explicitly calls the fragmentation percentage a 'metric' as opposed to a direct measurement.)
I believe that one interesting consequence of the current OmniOS code is that a pool on 4K sector disks (a pool with ashift=12) can never be reported as more than 95% fragmented, because 4K is the minimum allocation size and thus the minimum free segment size. I would not be surprised if in the future ZFS modifies the fragmentation percents reported for such pools so that 4K segments become '100% fragmented'.
(Technically it would be a per-vdev thing, but in practice I think that very few people mix vdevs with different ashifts and block sizes.)
I was initially planning on writing up the technical details too, but this entry is already long enough as it is so I'm deferring them to another entry.