How ZFS scrubs routinely save us

December 26, 2013

A while back I wrote about how ZFS resilvering saved us and mentioned in passing that there are a number of ways that ZFS routinely saves us in the small. Today I want to talk about one of them, namely ZFS scrubs.

Put simply, ZFS scrubs are integrity checks of your ZFS pools. When you scrub a pool it checks and verifies all copies of all pool data to make sure that they're fully intact. When it finds a checksum inconsistency it will repair it; if things are really bad and it's not possible to repair it, it'll tell you what got damaged so you can restore it from backups. If a scrub discovers a read error it generally won't try to rewrite the data but it will at least tell you about it.

We regularly scrub our pools through automation. This periodically turns up transient checksum errors, which it also fixes. So this is the first little save; ZFS has detected and fixed potential data problems for us and it does it on an essentially ongoing basis. As a pragmatic thing the scrubs also check for read errors (although they can't fix them) and so give us early warning on disks we probably want to replace. They also give us a way to check if read errors are transient or permanent; we simply schedule a scrub and see if the scrub gets errors.

(A surprisingly large amount of the time the scrub does not, either because the error was genuinely transient or because whatever object was using the bad sector has been deleted since then.)

As a corollary, forcing an immediate scrub lets us find out if there are any latent problems (which can have many potential causes). It's routine for us to force scrubs after significant outage events, such as an iSCSI backend losing power, to make sure that no data got lost in the chaos.

Of course it would be better if we didn't have checksum errors happen in the first place. But given that we have something going wrong, I'd much rather know about it and have it get fixed than not. ZFS does this for us without fuss or hassle, and that routinely saves us in the small.

(Much of this can be done by any RAID system with routine RAID array scans; Linux software RAID can do this, for example, and is often configured to do it. What is different about ZFS is that ZFS can tell which copy of inconsistent data is correct and which isn't. Other RAID systems have to just guess.)

(I've talked about the overall purposes of ZFS scrubs in an aside here.)

Written on 26 December 2013.
« Procedures are not documentation
A reason to keep tags external in 'entry as file' blog engines »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 26 01:01:56 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.