Some notes on ZFS per-user quotas and their interactions with NFS

March 3, 2017

In addition to quotas on filesystems themselves (refquota) and quotas on entire trees (plain quota), ZFS also supports per-filesystem quotas on how much space users (or groups) can use. We haven't previously used these for various reasons, but today we had a situation with an inaccessible runaway user process eating up all the free space in one pool on our fileservers and we decided to (try to) stop it by sticking a quota on the user. The result was reasonably educational and led to some additional educational experimentation, so now it's time for notes.

User quotas for a user on a filesystem are created by setting the userquota@<user> property of the filesystem to some appropriate value. Unlike overall filesystem and tree quotas, you can set a user quota that is below the user's current space usage. To see the user's current space usage, you look at userused@<user> (which will have its disk space number rounded unless you use 'zfs get -p userused@<user> ...'). To clear the user's quota limit after you don't need it any more, set it to none instead of a size.

(The current Illumos zfs manpage has an annoying mistake, where its section on the userquota@<user> property talks about finding out space by looking at the 'userspace@<user>' property, which is the wrong property name. I suppose I should file a bug report.)

Since user quotas are per-filesystem only (as mentioned), you need to know which filesystem or filesystems your errant user is using space on in your pool in order to block a runaway space consumer. In our case we already have some tools for this and had localized the space growth to a single filesystem; otherwise, you may want to write a script in advance so you can freeze someone's space usage at its current level on a collection of filesystems.

(The mechanics are pretty simple; you set the userquota@<user> value to the value of the userspace@<user> property, if it exists. I'd use the precise value unless you're sure no user will ever use enough space on a filesystem to make the rounding errors significant.)

Then we have the issue of how firmly and how fast quotas are enforced. The zfs manpage warns you explicitly:

Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the EDQUOT error message.

This is especially the case over NFS (at least NFS v3), where NFS clients may not start flushing writes to the NFS server for some time. In my testing, I saw the NFS client's kernel happily accept a couple of GB of writes before it started forcing them out to the fileserver.

The behavior of an OmniOS NFS server here is somewhat variable. On the one hand, we saw space usage for our quota'd user keep increasing over the quota for a certain amount of time after we applied the quota (unfortunately I was too busy to time it or carefully track it). On the other hand, in testing, if I started to write to an existing but empty file (on the NFS client) once I was over quota, the NFS server refused all writes and didn't put any data in the file. My conclusion is that at least for NFS servers, the user may be able to go over your quota limit by a few hundred megabytes under the right circumstances. However, once ZFS knows that you're over the quota limit a lot of things shut down immediately; you can't make new files, for example (and NFS clients helpfully get an immediate error about this).

(I took a quick look at the kernel code but I couldn't spot where ZFS updates the space usage information in order to see what sort of lag there is in the process.)

I haven't tested what happens to fileserver performance if a NFS client keeps trying to write data after it has hit the quota limit and has started getting EDQUOTA errors. You'd think that the fileserver should be unaffected, but we've seen issues when pools hit overall quota size limits.

(It's not clear if this came up today when the user hit the quota limit and whatever process(es) they were running started to get those EDQUOTA errors.)

Written on 03 March 2017.
« Cheap concurrency is an illusion (at least on Unix)
Why exposing only blocking APIs are ultimately a bad idea »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Mar 3 01:01:22 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.