Wandering Thoughts archives

2015-02-10

Our ZFS fileservers have a serious problem when pools hit quota limits

Sometimes not everything goes well with our ZFS fileservers. Today was one of those times and as a result this is an entry where I don't have any solutions, just questions. The short summary is that we've now had a fileserver get very unresponsive and in fact outright lock up when a ZFS pool that's experiencing active write IO runs into a pool quota limit.

Importantly, the pool has not actually run out of actual disk space; it has only run into the quota limit, which is about 235 GB below the space limit as 'zfs list' reports it (or would, if there was no pool quota). Given things we've seen before with full pools I would not have been surprised to experience these problems if the pool had run itself out of actual disk space. However it didn't; it only ran into an entirely artificial quota limit. And things exploded anyways.

(Specifically, the pool had a quota setting, since refquota on a pool where all the data is in filesystems isn't good for much.)

Unfortunately we haven't gotten a crash dump. By the time there was serious problem indications the system had locked up, and anyways our past attempts to get crash dumps in the same situation have been ineffective (the system would start to dump but then appear to hang). To the extent that we can tell anything, the few console messages that get logged sort of vaguely suggest kernel memory issues. Or perhaps I am simply reading too much into messages like 'arl_dlpi_pending unsolicited ack for DL_UNITDATA_REQ on e1000g1'. Since the problem is erratic and usually materializes with little or no warning, I don't think we've captured eg mpstat output during the run-up to a lockup to see things like if CPU usage is going through the roof.

I don't think that this happens all the time, as we've had this specific pool go to similar levels of being almost full before and the system hasn't locked up. The specific NFS IO pattern likely has something to do with it, as we've failed to reproduce system lockups in a test setup even with genuinely full pools, but of course we have no real idea what the IO pattern is. Given our multi-tenancy we can't even be confident that IO to the pool itself is the only contributor; we may need a pattern of IO to other pools as well to trigger problems.

(I also suspect that NFS and iSCSI are probably all involved in the problem. Partly this is because I would have expected a mere pool quota issue with ZFS alone to have been encountered before now, or even with ZFS plus NFS since a fair number of people run ZFS based NFS fileservers. I suspect we're one of the few places using ZFS with iSCSI as the backend and then doing NFS on top of it.)

One thing that writing this entry has convinced me is that I should pre-write a bunch of questions and things to look at in a file so I have them on hand the next time things start going south and I don't have to rely on my fallible memory to come up with what troubleshooting we want to try. Of course these events are sufficiently infrequent that I may forget where I put the file by the time the next one happens.

solaris/ZFSNFSPoolQuotaProblem written at 00:38:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.