Our ZFS fileservers have a serious problem when pools hit quota limits
Sometimes not everything goes well with our ZFS fileservers. Today was one of those times and as a result this is an entry where I don't have any solutions, just questions. The short summary is that we've now had a fileserver get very unresponsive and in fact outright lock up when a ZFS pool that's experiencing active write IO runs into a pool quota limit.
Importantly, the pool has not actually run out of actual disk space;
it has only run into the quota limit, which is about 235 GB below
the space limit as '
zfs list' reports it
(or would, if there was no pool quota). Given things we've seen
before with full pools I would not have been
surprised to experience these problems if the pool had run itself
out of actual disk space. However it didn't; it only ran into an
entirely artificial quota limit. And things exploded anyways.
(Specifically, the pool had a
quota setting, since
on a pool where all the data is in filesystems isn't good for
Unfortunately we haven't gotten a crash dump. By the time there was
serious problem indications the system had locked up, and anyways
our past attempts to get crash dumps in the same situation have
been ineffective (the system would start to dump but then appear
To the extent that we can tell anything, the few console messages that
get logged sort of vaguely suggest kernel memory issues. Or perhaps
I am simply reading too much into messages like '
unsolicited ack for DL_UNITDATA_REQ on e1000g1'. Since the problem
is erratic and usually materializes with little or no warning, I don't
think we've captured eg
mpstat output during the run-up to a lockup to
see things like if CPU usage is going through the roof.
I don't think that this happens all the time, as we've had this specific pool go to similar levels of being almost full before and the system hasn't locked up. The specific NFS IO pattern likely has something to do with it, as we've failed to reproduce system lockups in a test setup even with genuinely full pools, but of course we have no real idea what the IO pattern is. Given our multi-tenancy we can't even be confident that IO to the pool itself is the only contributor; we may need a pattern of IO to other pools as well to trigger problems.
(I also suspect that NFS and iSCSI are probably all involved in the problem. Partly this is because I would have expected a mere pool quota issue with ZFS alone to have been encountered before now, or even with ZFS plus NFS since a fair number of people run ZFS based NFS fileservers. I suspect we're one of the few places using ZFS with iSCSI as the backend and then doing NFS on top of it.)
One thing that writing this entry has convinced me is that I should pre-write a bunch of questions and things to look at in a file so I have them on hand the next time things start going south and I don't have to rely on my fallible memory to come up with what troubleshooting we want to try. Of course these events are sufficiently infrequent that I may forget where I put the file by the time the next one happens.