Some notes on ZFS per-user quotas and their interactions with NFS
In addition to quotas on filesystems themselves (refquota
) and quotas
on entire trees (plain quota
), ZFS also supports per-filesystem quotas
on how much space users (or groups) can use. We haven't previously
used these for various reasons, but today we had a situation with an
inaccessible runaway user process eating up all the free space in one
pool on our fileservers and we decided to (try
to) stop it by sticking a quota on the user. The result was reasonably
educational and led to some additional educational experimentation, so
now it's time for notes.
User quotas for a user on a filesystem are created by setting the
userquota@<user>
property of the filesystem to some appropriate
value. Unlike overall filesystem and tree quotas, you can set a
user quota that is below the user's current space usage. To see the
user's current space usage, you look at userused@<user>
(which
will have its disk space number rounded unless you use 'zfs get
-p userused@<user> ...'
). To clear the user's quota limit after
you don't need it any more, set it to none
instead of a size.
(The current Illumos zfs
manpage has an annoying mistake, where
its section on the userquota@<user>
property talks about finding
out space by looking at the 'userspace@<user>
' property, which
is the wrong property name. I suppose I should file a bug report.)
Since user quotas are per-filesystem only (as mentioned), you need to know which filesystem or filesystems your errant user is using space on in your pool in order to block a runaway space consumer. In our case we already have some tools for this and had localized the space growth to a single filesystem; otherwise, you may want to write a script in advance so you can freeze someone's space usage at its current level on a collection of filesystems.
(The mechanics are pretty simple; you set the userquota@<user>
value to the value of the userspace@<user>
property, if it exists.
I'd use the precise value unless you're sure no user will ever use
enough space on a filesystem to make the rounding errors significant.)
Then we have the issue of how firmly and how fast quotas are enforced.
The zfs
manpage warns you explicitly:
Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the
EDQUOT
error message.
This is especially the case over NFS (at least NFS v3), where NFS clients may not start flushing writes to the NFS server for some time. In my testing, I saw the NFS client's kernel happily accept a couple of GB of writes before it started forcing them out to the fileserver.
The behavior of an OmniOS NFS server here is somewhat variable. On the one hand, we saw space usage for our quota'd user keep increasing over the quota for a certain amount of time after we applied the quota (unfortunately I was too busy to time it or carefully track it). On the other hand, in testing, if I started to write to an existing but empty file (on the NFS client) once I was over quota, the NFS server refused all writes and didn't put any data in the file. My conclusion is that at least for NFS servers, the user may be able to go over your quota limit by a few hundred megabytes under the right circumstances. However, once ZFS knows that you're over the quota limit a lot of things shut down immediately; you can't make new files, for example (and NFS clients helpfully get an immediate error about this).
(I took a quick look at the kernel code but I couldn't spot where ZFS updates the space usage information in order to see what sort of lag there is in the process.)
I haven't tested what happens to fileserver performance if a NFS
client keeps trying to write data after it has hit the quota limit
and has started getting EDQUOTA
errors. You'd think that the
fileserver should be unaffected, but we've seen issues when pools
hit overall quota size limits.
(It's not clear if this came up today when the user hit the quota
limit and whatever process(es) they were running started to get
those EDQUOTA
errors.)
|
|