Wandering Thoughts archives


How zpool status reports on ZFS scrubs and resilvers

A recent thread on the ZFS mailing list concerned inaccurate or misleading reports from zpool status about the progress of scrubs (both bad time estimates and a scrub that wasn't finishing despite claiming to be 100% done). As it happens, I know something about how this works because I recently went digging into what information the kernel actually reports to userland.

The kernel reports the following information:

  • the number of bytes currently allocated in the pool and in each vdev.

  • how many bytes in the pool have been examined by the scrub or resilver; on current versions of Solaris, it also reports per-vdev bytes examined as well.

    (When resilvering the kernel also reports how many bytes have been repaired in the pool and on each vdev.)

  • the time that the scrub or resilver has been running.

Because zpool status only gets a snapshot of one moment in the scrub it can only make a straightforward extrapolation of how much time it will take to scrub the whole pool. So if the scrub rate initially starts out quite fast but then slows down later due to fragmentation (because ZFS does not scrub linearly) or a bunch of user IO interfering with the scrub, your time to completion will bounce around significantly.

(Current versions of zpool also report how long the scrub has been running, which is in many ways a much more useful number.)

Now, observe something important: there is no explicit count of how many bytes there are left to scrub. zpool status simply assumes that 'allocated - scrubbed' is how many bytes are remaining to scrub, but this isn't necessarily the case. In some situations it's possible for ZFS to have scrubbed more bytes than are actually allocated in the pool, resulting in scrubs reaching 100% without finishing (per this bug report).

(This doesn't necessarily mean that ZFS scrubs chase updates; you could equally well get into this situation by deleting a big snapshot after ZFS had scrubbed it but before the scrub finishes. This will immediately drop the bytes allocated count without affecting bytes scrubbed.)

PS: on inspecting current OpenSolaris source code, I see that it's recently moved to a more complicated scheme where it tries to keep better track of these numbers, likely as a consequence of the above bug. This will presumably appear in Solaris in some future patch or update.

PPS: contrary to what various sources will tell you (including the above bug report), the Solaris kernel really does report these statistics in bytes, not blocks.

solaris/ZFSReportingScrubs written at 13:30:17; Add Comment

ZFS scrubs and resilvers are not sequential IO

Here is something that is probably not as widely known as it could be: ZFS scrubs and resilvers are not done with sequential IO, the way conventional RAID resynchronization and checking is done.

Conventional RAID resyncs run in linear order from the start of each disk to the end. This means that they're strictly sequential IO if they're left undisturbed by user IO (which is one reason that you can destroy conventional RAID resync performance by doing enough random user IO).

ZFS scrubs and resilvers don't work like this; instead, they effectively walk 'down' the data structures that make up the ZFS pool and its filesystems, starting from the uberblocks and ending up at file data blocks. This isn't surprising for scrubs; this pattern is common in fsck-like consistency checking that needs to verify metadata consistency. Resilvering is apparently done this way partly because it makes the code simpler (resilvering and scrubbing are basically the same code in the kernel) and partly because this restores redundancy to the most important data first (redundant file contents are pointless if you can't find them, whereas an error in a non-redundant uberblock could ruin your entire day).

This has a number of consequences. One of them is that the more fragmented your pool is, the more you have randomly created and deleted and overwritten files and portions of files, the slower it will likely scrub and resilver. This is because fragmentation causes things to be scattered over the disk(s), which requires more seeks and gives the scrubbing process less chance for fast sequential IO. (Remember that modern disks can only do about 100 to 120 seeks a second.)

(I think that a corollary to this is that lots of little files will make your ZFS pools scrub slower, especially if you create and delete them randomly all over the filesystem. An old-style Usenet spool filesystem would probably be a ZFS worst case.)

I'm not sure how (or if) ZFS scrubbing deals with changes to the ZFS pool. ZFS's design means that scrubbing won't get confused by updates, but if it chases them it could do a potentially unbounded amount of work if you keep deleting old data and creating new data fast enough; if it doesn't chase updates, it may miss recent problems.

(This information rattles around the ZFS mailing list, which is where I picked it up from.)

solaris/ZFSNonlinearScrubs written at 00:41:24; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.