Limiting the size of things in a filesystem is harder than it looks

October 9, 2019

Suppose, not entirely hypothetically, that you want to artificially limit the size of a filesystem or perhaps something within it, for example the space used by a particular user. These sort of limits usually get called quotas. Although you might innocently think that enforcing quotas is fairly straightforward, it turns out that it can be surprisingly complicated and hard even in very simple filesystems. As filesystems become more complicated, it can rapidly become much more tangled than it looks.

Let's start with a simple filesystem with no holes in files and us only wanting to limit the amount of data that a user has (not filesystem metadata). If the user tries to write 128 Kb to some file, we already need to know where in the file it's going. If the 128 Kb is entirely overwriting existing data, the user uses no extra space, if it's being added to the end of the file, they use 128 Kb more space, and if it partly overlaps with the end of the file, they use less than 128 Kb. Fortunately the current size of a file that's being written to is generally very accessible to the kernel, so we can probably know right away whether the user's write can be accepted or has to be rejected because of quota issues. Well, we can easily know until we throw multiple CPUs into the situation, with different programs on different CPUs all executing writes at once. Once we have several CPUs, we have to worry about synchronizing our information on how much space the user is currently using.

Now, suppose that we want to account for filesystem metadata as well, and that files can have unallocated space in the middle of themselves. Now the kernel doesn't know how much space 128 Kb of file data is going to use until it's looked at the file's current indirect blocks. Writing either after the current end of the file or before it may require allocating new data blocks and perhaps new indirect blocks (in extreme cases, several levels of them). The existing indirect blocks for the file may or may not already be in memory; if they aren't, the kernel doesn't know whether it can accept the write until it reads them off disk, which may take a while. The kernel can optimistically accept the write, start allocating space for all of the necessary data and metadata, and then abort if it runs into a quota limit by the end. But if it does this, it has to have the ability to roll back all of those allocations it may already have done.

(Similar issues come up when you're creating or renaming files and more broadly whenever you're adding entries to a directory. The directory may or may not have a free entry slot already, and adding your new or changed name may cause a cascade of allocation changes, especially in sophisticated directory storage schemes.)

Features like compression and deduplication during writes complicate this picture further, because you don't know how much raw data you're going to need to write until you've gone through processing it. You can even discover that the user will use less space after the write than before, if they replace incompressible unique data with compressible or duplicate data (an extreme case is turning writes of enough zero bytes into holes).

If the filesystem is a modern 'copy on write' one such as ZFS, overwriting existing data may or may not use extra space even without compression and deduplication. Overwriting data allocates a new copy of the data (and metadata pointing to it), but it also normally frees up the old version of the data, hopefully giving you a net zero change in usage. However, if the old data is part of a snapshot or otherwise referenced, you can't free it up and so an 'overwrite' of 128 Kb may consume the same amount of space as appending it to the file as new data.

Filesystems with journals add more issues and questions, especially the question of whether you add operations to the journal before you know whether they'll hit quota limits or only after you've cleared them. The more you check before adding operations to the journal, the longer user processes have to wait, but the less chance you have of hitting a situation where an operation that's been written to the journal will fail or has to be annulled. You can certainly design your journal format and your journal replay code to cope with this, but it makes life more complicated.

At this point you might wonder how filesystems that support quotas ever have decent performance, if checking quota limits involves all of this complexity. One answer is that if you have lots of quota room left, you can cheat. For instance, the kernel can know or estimate the worst case space usage for your 128 Kb write, see that there is tons of room left in your quota even in the face of that, and not delay while it does further detailed checks. One way to deal with the SMP issue is to keep a very broad count of how much outstanding write IO there is (which the kernel often wants anyway) and not bother with synchronizing quota information if the total outstanding writes are significantly less than the quota limit.

(I didn't realize a lot of these lurking issues until I started to actually think about what's involved in checking and limiting quotas.)

Written on 09 October 2019.
« How we implement reboot notifications when our machines reboot in Prometheus
Some additional information on ZFS performance as you approach quota limits »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Oct 9 21:31:43 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.