Wandering Thoughts archives

2024-12-10

My wish for VFS or filesystem level cgroup (v2) IO limits

Over on the Fediverse, I wished for better IO limits than cgroup (v2) has today:

I wish Linux cgroups (v2 of course) had an option/interface that limited *filesystem* IO that you could do, read and/or write. They have 'block IO' limits but these are often ineffective for an assortment of reasons, including that you're not doing block IO (hi, NFS) or that the underlying filesystem and storage stack doesn't support them (... many). A VFS level limit would be nominally simple and very useful.

Cgroup(s) v2 have an IO controller, that has both priorities (which only work in limited circumstances) and absolute limits, which are applied on a per disk device basis and so appear to have some serious limitations. Based on what appears in the io.stat file, you might be able to limit bandwidth to a software RAID device, but you can't do it to ZFS filesystems (which is important to us) or to NFS filesystems, since the latter don't have 'block IO'.

In theory, this could be worked around if cgroup(s) v2 also had a controller that operated at the higher level of Linux's VFS, or perhaps at the level of individual filesystems, where it would apply to things like read(), write(), and IO performed through mmap()'d files. Since all filesystems go through the VFS, these limits would naturally apply no matter what the underlying filesystem was or the storage stack it was operating on top of.

As covered in /proc/pid/mountinfo(5), filesystems, well, mounts, do have identifying numbers that could be used in the same way as the block device numbers used by the IO controller, in order to implement filesystem specific limits. But I'd be happy with an overall limit, and in fact I'd like one even if you could set per-filesystem limits too.

(The current memory and IO controllers cooperate to create writeback limits, but a VFS level write bandwidth limit might be easier and more direct.)

However, I suspect that even a general VFS wide set of limits will never be implemented in cgroup v2, for two reasons. First, such a limit only cleanly applies to direct read and write IO involving files; it's at best awkward to extended it to, for example, reading directories, or worse, doing 'metadata' operations like creating, renaming, and deleting files (all of which can trigger various amounts of IO), or stat()'ing things. Second, I suspect that there would be implementation complexities in applying this to memory mapped files, although maybe you could put the relevant process (thread) to sleep for a while in page fault handling in order to implement rate limits.

(The cgroup v2 people might also consider VFS level limits to be the wrong way to go about things, but then I don't know how they expect people to rate limit networked filesystems. As far as I can tell, there is currently no network controller to rate limit overall network traffic, and it would likely need cooperation from the NFS client implementation anyway.)

linux/CgroupVFSIORatelimitWish written at 23:02:50;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.