2014-10-26
Things that can happen when (and as) your ZFS pool fills up
There's a shortage of authoritative information on what actually happens if you fill up a ZFS pool, so here is what I've both gathered about it from other people's information and experienced.
The most often cited problem is bad performance, with the usual cause being ZFS needing to do an increasing amount of searching through ZFS metaslab space maps to find free space. If not all of these are in memory, a write may require pulling some or all of them into memory, searching through them, and perhaps finding not enough space. People cite various fullness thresholds for this starting to happen, eg anywhere from 70% full to 90% full. I haven't seen any discussion about how severe this performance impact is supposed to be (and on what sort of vdevs; raidz vdevs may behave differently than mirror vdevs here).
(How many metaslabs you have turns out to depend on how your pool was created and grown.)
A nearly full pool can also have (and lead to) fragmentation, where the free space is in small scattered chunks instead of large contiguous runs. This can lead to ZFS having to write 'gang blocks', which are a mechanism where ZFS fragments one large logical block into smaller chunks (see eg the mention of them in this entry and this discussion which corrects some bits). Gang blocks are apparently less efficient than regular writes, especially if there's a churn of creation and deletion of them, and they add extra space overhead (which can thus eat your remaining space faster than expected).
If a pool gets sufficiently full, you stop being able to change most filesystem properties; for example, to set or modify the mountpoint or change NFS exporting. In theory it's not supposed to be possible for user writes to fill up a pool that far. In practice all of our full pools here have resulted in being unable to make such property changes (which can be a real problem under some circumstances).
You are supposed to be able to remove files from a full pool (possibly barring snapshots), but we've also had reports from users that they couldn't do so and their deletion attempt failed with 'No space left on device' errors. I have not been able to reproduce this and the problem has always gone away on its own.
(This may be due to a known and recently fixed issue, Illumos bug #4950.)
I've never read reports of catastrophic NFS performance problems for all pools or total system lockup resulting from a full pool on an NFS fileserver. However both of these have happened to us. The terrible performance issue only happened on our old Solaris 10 update 8 fileservers; the total NFS stalls and then system lockups have now happened on both our old fileservers and our new OmniOS based fileservers.
(Actually let me correct that; I've seen one report of a full pool killing a modern system. In general, see all of the replies to my tweeted question.)
By the way: if you know of other issues with full or nearly full ZFS pools (or if you have additional information here in general), I'd love to know more. Please feel free to leave a comment or otherwise get in touch.