2020-05-08
Revisiting what the ZFS recordsize is and what it does
I'm currently reading Jim Salter's ZFS 101—Understanding ZFS
storage and performance,
and got to the section on ZFS's important recordsize
property,
where the article attempts to succinctly explain a complicated ZFS
specific thing. ZFS recordsize is hard to explain because it's
relatively unlike what other filesystems do, and looking back I've
never put down a unified view of it in one place.
The simplest description is that ZFS recordsize is the (maximum) logical block size of a filesystem object (a file, a directory, a whatever). Files smaller than recordsize have a single logical block that's however large it needs to be (details here); files of recordsize or larger have some number of recordsize logical blocks. These logical blocks aren't necessarily that large in physical blocks (details here); they may be smaller, or even absent entirely (if you have some sort of compression on and all of the data was zeros), and under some circumstances the physical block can be fragmented (these are 'gang blocks').
(ZFS normally doesn't fragment the physical block that implements your logical block, for various good reasons including that one sequential read or write is generally faster than several of them.)
However, this logical block size has some important consequences because ZFS checksums are for a single logical block. Since ZFS always verifies the checksum when you read data, it must read the entire logical block even if you ask only for a part of it; otherwise it doesn't have all the data it needs to compute the checksum. Similarly, it has to read the entire logical block even when your program is only writing a bit of data to part of it, since it has to update the checksum for the whole block, which requires the rest of the block's data. Since ZFS is a copy on write system, it then rewrites the whole logical block (into however large a physical block it now requires), even if you only updated a little portion of it.
Another consequence is that since ZFS always writes (and reads) a full logical block, it also does its compression at the level of logical blocks (and if you use ZFS deduplication, that also happens on a per logical block basis). This means that a small recordsize will generally limit how much compression you can achieve, especially on disks with 4K sectors.
(Using a smaller maximum logical block size may increase the amount of data that you can deduplicate, but it will almost certainly increase the amount of RAM required to get decent performance from deduplication. ZFS deduplication's memory requirements for good performance are why you should probably avoid it; making them worse is not usually a good idea. Any sort of deduplication is expensive and you should use it only when you're absolutely sure it's worth it for your case.)
Linux software RAID resync speed limits are too low for SSDs
When you add or replace a disk in Linux's software RAID, it has to
be resynchronized with the rest of the RAID array. As very briefly
covered in the RAID wiki's page on resync, this resync process
has speed limits that are controlled by the kernel sysctls
dev.raid.speed_limit_min and dev.raid.speed_limit_max (in
KBytes a second). As covered in md(4)
), if there's no
other relevant IO activity, resync will run up to the maximum speed;
if there is other relevant IO activity, the resync speed will
throttle down to the minimum (which many people would raise on
the fly in order to make resyncs go faster).
(In current kernels, it appears that relevant IO activity is any IO activity to the underlying disks of the software RAID, whether or not it's through the array being resynced.)
If you look at your system, you will very likely see that the values for minimum and maximum speeds are 1,000 KB/sec and 200,000 KB/sec respectively; these have been the kernel defaults since at least 2.6.12-rc2 in 2005, when the Linux kernel git repository was started. These were fine defaults in 2005 in the era of hard drives that were relatively small and relatively slow, and in particular for you were very unlikely to approach the maximum speed even on fast hard drives. Even fast hard drives generally only went at 160 Mbytes/sec of sustained write bandwidth, comfortably under the default and normal speed_limit_max.
This is no longer true in a world where SSDs are increasingly common (for example, all of our modern Linux servers with mirrored disks use SSDs). In theory SSDs can write at data rates well over 200 MBytes/sec; claimed data rates are typically around 500 Mbytes/sec for sustained writes. In this world, the default software RAID speed_limit_max value is less than half the speed that you might be able to get, and so you should strongly consider raising dev.raid.speed_limit_max if you have SSDs.
You should probably also raise speed_limit_min, whether or not you have SSDs, because the current minimum is effectively 'stop the resync when there's enough other IO activity' since modern disks are big enough that they will often take more than a week to resync at 1,000 KB/sec. You probably don't want to wait that long. If you have SSDs, you should probably raise it a lot, since SSDs don't really suffer from random IO slowing everything down the way HDs do.
(Raising both of these significantly will probably become part of our standard server install, now that this has occurred to me.)
Unfortunately, depending on what SSDs you use, this may not do you as much good as you would like, because it seems that some SSDs can have very unimpressive sustained write speeds in practice over a large resync. We have a bunch of basic SanDisk 64 GB SSDs (the 'SDSSDP06') that we use in servers, and we lost one recently and had to do a resync on that machine. Despite basically no other IO load at the time (and 100% utilization of the new disk), the eventual sustained write rate we got was decidedly unimpressive (after an initial amount of quite good performance). The replacement SSD had been used before, so perhaps the poor SSD was busy frantically erasing flash blocks and so on as we were trying to push data down its throat.
(Our metrics system makes for interesting viewing during the resync. It appears that we wrote about 43 GB of the almost 64 GB to the new SSD at probably the software RAID speed limit before write bandwidth fell off a cliff. It's just that the remaining portion of about 16 GB of writes took several times as long as the first portion.)