ZFS performance really does degrade as you approach quota limits
Every so often (currently monthly), there is an "OpenZFS leadership meeting". What this really means is 'lead developers from the various ZFS implementations get together to talk about things'. Announcements and meeting notes from these meetings get sent out to various mailing lists, including the ZFS on Linux ones. In the September meeting notes, I read a very interesting (to me) agenda item:
- Relax quota semantics for improved performance (Allan Jude)
- Problem: As you approach quotas, ZFS performance degrades.
- Proposal: Can we have a property like quota-policy=strict or loose, where we can optionally allow ZFS to run over the quota as long as performance is not decreased.
This is very interesting to me because of two reasons. First, in
the past we have definitely seen significant problems on our OmniOS
machines, both when an entire pool hits a
quota limit and when a single filesystem hits a
refquota limit. It's nice to know
that this wasn't just our imagination and that there is a real issue
here. Even better, it might someday be improved (and perhaps in a
way that we can use at least some of the time).
Second, any number of people here run very close to and sometimes at the quota limits of both filesystems and pools, fundamentally because people aren't willing to buy more space. We have in the past assumed that this was relatively harmless and would only make people run out of space. If this is a known issue that causes serious performance degradation, well, I don't know if there's anything we can do, but at least we're going to have to think about it and maybe push harder at people. The first step will have to be learning the details of what's going on at the ZFS level to cause the slowdown.
(It's apparently similar to what happens when the pool is almost full, but I don't know the specifics of that either.)
With that said, we don't seem to have seen clear adverse effects on our Linux fileservers, and they've definitely run into quota limits (repeatedly). One possible reason for this is that having lots of RAM and SSDs makes the effects mostly go away. Another possible reason is that we haven't been looking closely enough to see that we're experiencing global slowdowns that correlate to filesystems hitting quota limits. We've had issues before with somewhat subtle slowdowns that we didn't understand (cf), so I can't discount that we're having it happen again.