My problem with ZFS

June 3, 2008

The more I use ZFS the less happy I become with it. It's not from any issues with ZFS's features or with how it works in practice; ZFS isn't one of those things that look better in theory than in practice. Based on my exposure so far it pretty much does live up to its promise, and the ZFS people are busy sanding off the rough edges and adding useful features.

My problem with ZFS is that if something goes wrong it all too often goes wrong in a big way; there are very few states between 'your pool is fine' and 'your pool is totally dead and probably panicing the system'. In practice, ordinary filesystems almost always degrade gradually, so that when something goes wrong (even fairly severe things) the system keeps lurching on as best as it can and keeps giving you as much data as possible. In the same sort of situations ZFS just shatters; your data is entirely lost and your entire system probably just went down. This might be tolerable if things only went badly wrong once in a while, but this is not the case; it is distressingly easy to run into situations that ZFS doesn't know how to cope with (I seem to have a talent for it).

So what about ZFS's vaunted resilience? My strong impression is that ZFS is only resilient against very limited sorts of damage; outside of those areas, ZFS makes no attempt to cope and just gives up entirely. This is no doubt easy to code, but does not strike me as at all the right thing to do.

This is especially bad because ZFS has no usable tools for recovering data from damaged filesystems, no equivalents of fsck and dump. With traditional filesystems, even badly damaged ones, you at least have tools that you can use to try to salvage something. With ZFS, damaged is dead; restore from backups.

(And ZFS snapshots will not help you, because those are stored within the ZFS pool; lose the pool and you lose the snapshots as well.)

Or in short: if something goes wrong with ZFS, it generally goes badly wrong. And things go wrong with ZFS too often for real comfort.

Sidebar: what went wrong today

I was testing what would happen if a ZFS pool on shared ISCSI storage was accidentally active on two systems at once, for example if you had a fumbled failover situation of some sort. So I did:

  • import the pool on both systems (forcing it on the second one)
  • read a 10 Gb test file on both systems
  • export the pool on one system, ending that system's use of it
  • scrub the pool on the other system to see what happened

What I hoped to happen was nothing much. What I got was not just a dead pool but a pool that paniced any system that tried to import it.

Comments on this page:

From at 2008-06-04 09:39:10:

wow! Nothing says "high availability" like a filesystem that commits suicide by taking out everything it touches.


By cks at 2008-06-06 00:07:27:

To be fair to ZFS, I don't think ZFS has ever claimed to be a high availability filesystem. I don't know if it's even claimed to be 'enterprise ready'.

(Whether ZFS is production ready in its current state is another argument, one that I don't currently have an answer to since we haven't had to make that decision yet.)

From at 2008-06-12 20:08:02:

Of course, all the docs warn you about importing a zfs pool on multiple systems. The -f force flag is there to let you import it if you know absolutely for sure that the pool is not imported elsewhere.

If you intentionally shoot yourself in the foot, it's hard to put the blame on zfs for that.

By cks at 2008-06-12 23:49:40:

The short answer is that you have to use -f in failover situations, and accidents happen (especially in situations where things have already gone wrong and people are under a lot of stress, ie many failover situations). The longer answer is in ZFSSanFailoverProblem.

What I would like is for accidents to have less catastrophic consequences in ZFS, because accidents always happen sooner or later. ZFS really appears to be a filesystem where you must do everything right, or else.

From at 2008-06-13 04:24:42:

We've seen this issue also, in a very similar setup to the one you're developing. Our only solution is the fencing you mention in your latest ZFSSanFailoverProblem entry. We have to ensure that the LUNs presented to each server are only visible to the system that is currently importing the zpool via management scripts which are used to perform failovers. They simply try to unmap LUNs for the same partitions that are mapped to the other host from the shared SAN.

This has gone horribly wrong recently due to a bug in one of the scripts, which then triggers this catastrophic failure of importing the zpool on both hosts. This caused the machines to panic when rebooting as they attempted to import the zpools until the LUNs were unmapped, leaving the zpool to be rebuilt from backups.

Glad to hear we're not alone though!


Written on 03 June 2008.
« Improving RPM as a packaging system
Some corollaries to the charging problem »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jun 3 23:51:59 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.