ZFS's 'panic on on-disk corruption' behavior is a serious flaw
Here's a Twitter conversation from today:
@aderixon: For a final encore at 4pm today, I used a corrupted zpool to kill an entire Solaris database cluster, node by node. #sysadmin
@thatcks: Putting the 'fail' in 'failover'?
@aderixon: Panic-as-a-service. Srsly, "zpool import" probably shouldn't do that.
@thatcks: Sadly, that's one of the unattractive sides of ZFS. 'Robust recovery from high-level on-disk metadata errors' is not a feature.
@aderixon: Just discovering this from bug reports. There will be pressure to go back to VXVM now. :-(
Let me say this really loudly:
Panicing the system is not an error-recovery strategy.
That ZFS is all too willing to resort to system panics instead of having real error handling or recovery for high level metadata corruption is a significant blemish. Here we see a case where this behavior has had a real impact on a real user, and may cause people to give up on ZFS entirely. They are not necessarily wrong to do so, either, because they've clearly hit a situation where ZFS can seriously damage their availability.
In my jaundiced sysadmin view, OS panics are for temporary situations where the entire system is sufficiently corrupt or unrecoverable that there is no way out. When ZFS panics on things that are recoverable with more work, it's simply being lazy and arrogant. When the issue is with a single pool, ZFS panicing converts a single-pool issue into an entire-server issue, and servers may have multiple pools and all sorts of activities.
Panicing due to on-disk corruption is even worse, as it converts lack of error recovery into permanent unavailability (often for the entire system). A temporary situation at least might clear itself when you panic the system and reboot, as you can hope that a corrupted in-memory data structure will be rebuilt in non-corrupted form when the system comes back up. But a persistent condition like on-disk corruption will never go away just because you reboot the server, so there is very little hope that ZFS's panic has worked around the problem. At the best, it's still lurking there like a landmine waiting to blow your system up later. At the worst, in single server situations you can easily get the system locked into a reboot loop, where it boots and starts an import and panics again. In clustering or failover environments, you can wind up taking the entire cluster down (along with all of its services) as the pool with corruption successively poisons every server that tries to recover it.
Unfortunately none of this is likely to change any time soon, at least in the open source version of ZFS. ZFS has been like this from the start and no one appears to care enough to fund the significant amount of work that would be necessary to fix its error handling.
(It's possible that Oracle will wind up caring enough about this to do the work in a future Solaris version, but I'm dubious even of that. And if they do, it's not like we can use it.)
(I had my own experience with this sort of thing years ago; see this 2008 entry. As far as I can tell, very little has changed in how ZFS reacts to such problems since then.)