Wandering Thoughts archives

2024-12-14

ZFS on Linux and block IO limits show some limits of being out of the kernel

ZFS on Linux (or, if you prefer, OpenZFS on Linux) is, famously, not included in the Linux kernel source for a number of reasons (starting with its licensing). The usual drawback of this is that (Open)ZFS can need modifications to support new Linux kernel versions because the internal kernel interfaces keep changing and, unlike in-kernel modules, there's nothing that keeps OpenZFS in sync with them. However, that ZFS is out of the kernel also has some other limits, and one of them is around cgroup v2 based block IO limits and IO priorities.

The obvious issue with these is that ZFS doesn't support them (at least not in a useful way). For instance, I believe that it doesn't particularly pass through information that would let cgroup v2 attribute all block IO to specific cgroups (for both read and write IO). This means that as far as I know, if you have something running on top of ZFS you can't use cgroup v2 to limit its IO impact, any more than you could if it was on NFS (you're left to wish for VFS level cgroup limits). ZFS being part of the Linux kernel source code wouldn't guarantee that it supported block IO cgroup accounting, but it probably makes it somewhat more likely.

The less obvious issue is that ZFS wasn't even in the room when the cgroup v2 IO controller was being discussed and designed. As hinted by how I talk about it in terms of 'block IO', the current IO controller is very strongly focused on dealing with block IO devices. Unfortunately this is a bad fit for ZFS, which integrates filesystems with storage management, so that in many typical configurations your user-level ZFS IO results in block IO to various different disks on an unpredictable basis.

(This works for filesystems that sit on top of a software RAID device, because software RAID devices are block IO devices and so you can manage IO limits at the level of the RAID device rather than its component devices. I don't know how well it works for btrfs configurations where btrfs is using multiple disks, although btrfs is listed as one of the supported filesystems for writeback limits, and btrfs's interoperability guide says the cgroup IO controller is fully supported.)

Had ZFS been in the kernel, the ZFS developers would have been in a good position to discuss how to make the cgroup V2 IO controller design work with ZFS, including possible changes in the IO controller itself (although this is no guarantee that it would have happened). I can imagine designs that at least sound plausible to me, such as each ZFS pool having a pseudo-device that all its IO can be attributed to so you can rate-limit and control a cgroup's IO to that device.

Out of kernel modules aren't just limited in having to keep up with the kernel's development; they're also limited in that they have a relatively low ability to influence the development of kernel features that are relevant to them.

(This isn't quite inevitable due to being out of the kernel tree; it's also because of social attitudes among the Linux kernel developers. Broadly speaking, the kernel developers have made it clear that they don't care about out of kernel modules and the concerns of those modules. Had the ZFS developers shown up on the Linux kernel mailing list to try to influence cgroup V2 IO controllers or ask for features, I suspect that they would have been ignored.)

linux/ZFSOnLinuxVersusBlockIOLimits written at 22:56:23;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.