Our ZFS spares handling system sort of relies on our patterns of disk usage

January 7, 2023

In my entry on our Linux ZFS spares handling system, I wrote about how we used spares in a two step preference order, first on another disk connected the same way (SATA or SAS) and then on any disk necessary. In a comment on the entry, Simon asked:

Doesn't this mean you could end up mirroring to the same (real) disks? So the redundancy you can normally expect is severely reduced. Mirroring to the same disk only helps with localized read/write errors (like a bad sector), but not things like a failed disk.

This is a question with a subtle answer, which starts with how we use disks and what that implies for available spares. We always use disks in mirrored pairs, and the pairs are fixed; every partition of every disk has a specific partner. The first partition of the first SAS-connected disk is always mirrored with the first partition of the first SATA-connected disk, and so on. This means that in normal operation (when a disk hasn't failed), all spares also come in pairs; if the last partition of the first 'SAS' disk isn't used, neither will be the last partition of the first 'SATA' disk, so both are available as spares. In addition, we spread our partition usage across all disks, using the first partition on all pairs before we start using the second partition on any of them, and so on.

Since spares come in pairs, if we have as many pairs of spares as we have partitions on a disk (so four pairs, eight spares in total, with our current 2 TB disks with four partitions), we're guaranteed to have enough spares on the same 'type' (SAS connected or SATA connected) of disk to replace a failed disk. Since the other side of every mirrored pair is on the different type, the replacement spares can't wind up on the same physical disk as the other side. Since we don't entirely allocate one disk before we mostly allocate all of them, all disks have either zero partitions free or one partition free and our spares are all on different disks.

(Now that I've written this down I've realized that it's only true as long as we have no more partitions per disks than we have disks of a particular type. We have eight disks per type so we're safe with 4 TB disks and eight partitions per disk, but we'll need to think about this again if we move beyond that.)

If we have fewer spares than that, we could be forced to use a spare on the same type of disk as the surviving side of a pair. Even then we can try to avoid using a partition on the same disk and often we'll be able to. If the failed disk had no free partitions, its pair also has no free partitions and we're safe. If it had one free partition and we have more spares than the number of partitions per disk (eg six spares with 2 TB disks), we can still find a spare on another disk than its pair.

The absolute worst case in our current setup is if we're down to four spares and we lose a disk with one of the spares. Here we need three spares (for the used partitions on the disk), we only have three spares left, and one of them is on the pair disk to the one we lost, which is the disk that needs new mirroring. In this case we'll mirror one partition on that disk with another partition on that disk. This still gives us protection against ZFS checksum errors, but it also means that we overlooked a case when we decided it was okay to drop down to a minimum of only four spares.

I'll have to think about this analysis for our 4 TB disk, eight partition case, but certainly for the 2 TB disk, four partition case it means that the minimum number of spares we should be keeping is six, not four. Fortunately we don't have any fileservers that have that few spares at the moment. Also, I need to re-check our actual code to see if it specifically de-prioritizes the disk of the partition we're adding a spare to.

(One fileserver wound up at four spares before we upgraded its data disks to 4 TB SSDs.)

Written on 07 January 2023.
« Setting alerts is a chance to figure out what you really care about
Let's Encrypt's complex authorization process and multi-name TLS certificates »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 7 21:14:48 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.