My problem with ZFS
The more I use ZFS the less happy I become with it. It's not from any issues with ZFS's features or with how it works in practice; ZFS isn't one of those things that look better in theory than in practice. Based on my exposure so far it pretty much does live up to its promise, and the ZFS people are busy sanding off the rough edges and adding useful features.
My problem with ZFS is that if something goes wrong it all too often goes wrong in a big way; there are very few states between 'your pool is fine' and 'your pool is totally dead and probably panicing the system'. In practice, ordinary filesystems almost always degrade gradually, so that when something goes wrong (even fairly severe things) the system keeps lurching on as best as it can and keeps giving you as much data as possible. In the same sort of situations ZFS just shatters; your data is entirely lost and your entire system probably just went down. This might be tolerable if things only went badly wrong once in a while, but this is not the case; it is distressingly easy to run into situations that ZFS doesn't know how to cope with (I seem to have a talent for it).
So what about ZFS's vaunted resilience? My strong impression is that ZFS is only resilient against very limited sorts of damage; outside of those areas, ZFS makes no attempt to cope and just gives up entirely. This is no doubt easy to code, but does not strike me as at all the right thing to do.
This is especially bad because ZFS has no usable tools for recovering data from damaged filesystems, no equivalents of fsck and dump. With traditional filesystems, even badly damaged ones, you at least have tools that you can use to try to salvage something. With ZFS, damaged is dead; restore from backups.
(And ZFS snapshots will not help you, because those are stored within the ZFS pool; lose the pool and you lose the snapshots as well.)
Or in short: if something goes wrong with ZFS, it generally goes badly wrong. And things go wrong with ZFS too often for real comfort.
Sidebar: what went wrong today
I was testing what would happen if a ZFS pool on shared ISCSI storage was accidentally active on two systems at once, for example if you had a fumbled failover situation of some sort. So I did:
- import the pool on both systems (forcing it on the second one)
- read a 10 Gb test file on both systems
- export the pool on one system, ending that system's use of it
- scrub the pool on the other system to see what happened
What I hoped to happen was nothing much. What I got was not just a dead pool but a pool that paniced any system that tried to import it.