The current weak areas of ZFS on Linux

September 2, 2013

I've been looking into ZFS on Linux for a while as a potential alternative to Illumos for our next generation of fileservers (FreeBSD is unfortunately disqualified). As part of that I have been working to understand ZoL's current weak areas so that I can better understand where it might cause us problems.

The following is the best current information I have; it comes from reading the ZoL mailing list (and at one point asking the ZoL mailing list this exact question).

The weak areas that I know about:

  • Using ZFS as your root filesystem requires wrestling with GRUB, Grub scripts, initramfs-building scripts, and support in installers (if you want to install the system as ZoL-root from the start). How well this works depends on your distribution; some have good support (eg Gentoo), others have third party repositories with prebuilt packages, and still others leave you on your own.

  • There are periodic problem reports about getting ZFS filesystems reliably mounted on boot.
  • In some environments ZoL can have problems reliably finding the disk devices for your pools on boot. This is especially likely if you use /dev/sd* device names but apparently sometimes happens to people who use more stable identifiers.

    (Apparently part of the likely solution is to hook ZoL into udev so that as disks are discovered ZoL checks to see if a pool now has a full set of devices and can be brought up.)

  • ZoL lacks a number of standard Linux filesystem features, including support for O_DIRECT, asynchronous IO, and POSIX ACLs. It also lacks support for issuing TRIM commands to drives (this is apparently only present in the FreeBSD version of ZFS so far).

  • There is no 'event daemon' to handle events like disks going away. The most significant result of this is that ZFS pool spares do not get activated on disk failure (making them basically pointless).

  • ZFS's use of kernel memory is not well integrated with the Linux kernel memory system, resulting in runaway memory usage in some situations. Apparently metadata intensive workloads (such as rsync runs) are especially prone to this.

The last issue deserves more discussion. All of this is what I've gathered from the mailing list and from looking at the ZFS on Linux source code.

To start with, ZFS on Linux is not really ZFS ported to Linux; instead it's mostly the Illumos ZFS code dropped on top of a layer of code to translate and emulate the Solaris kernel APIs that ZFS needs (the SPL, short for 'Solaris Porting Layer'). This includes a great deal of kernel memory handling. The unfortunate result of this is a series of mismatches between what ZFS thinks is going on with kernel memory and what is actually going on, due to the translation and emulation that is required. Through fragmentation that's invisible to ZFS and other issues, ZFS can wind up using a lot more memory for things like the ARC than it is supposed to (because ZFS thinks it's using a lot less memory than it actually is).

(I suspect that ZFS itself still has some degree of the ZFS level fragmentation problems we've seen but that's much less dangerous because it just leaves the ARC smaller than it should be. The ZoL problem is that the ARC and related things can eat all of your RAM and make your kernel explode.)

Whether this happens to you (and how much it affects you) is unpredictable because it depends very much on the details of how your system uses memory. As mentioned, people seem to have problems with metadata heavy workloads but not everyone reporting problems on the ZoL mailing lists is in this situation.

PS: if you are coming here from Internet searches, please pay attention to the date of this entry. I certainly hope that all of these issues will get dealt with over time.

Comments on this page:

By chris2 at 2013-09-04 10:04:50:

In my limited experiments, I had no problem with a zfs / and an ext2 /boot and an initramfs with zfs included (obviously). No GRUB tweaks required.

From at 2013-09-05 04:50:27:

TRIM is also available in ZFS on Solaris 11.1

By trx at 2013-09-19 10:03:52:

Thank you very much for this compact overview.

It seems like, in most of the cases, only show-stopper is the last problem: I can boot from other device/partition/file system, I can postpone mounting of ZFS file systems on boot, I can use more persistent device naming scheme, I'll find a way around performance-related issues for the start and I'll monitor output of 'zpool status' and SMART daemon to find failed drives, but I cannot deal with random memory exhaustion. That's just not acceptable for file system meant to be reliable.

Hope that will be the highest priority for ZoL contributors. Everything else can be improved later...

By cks at 2013-09-19 14:28:18:

The memory issues were also the killer for us, unfortunately. Part of the problem (for us) was that we wouldn't have been able to test for them in advance and in fact a production fileserver might initially work and then fall over later as the usage patterns change. We might have been fine, we might not have been, and the risk and uncertainty were too high.

By bassu at 2013-10-04 01:32:18:

It is quite rudimentary to keep your OS and storage separate -- simple as that. I am not sure why people are so obsessive on mixing ZFS with the root and boot whereas it's main purpose is to do storage just!

As for the memory leaks, they are every where, I have seen much worse in Xen, KVM, Apache and you-name-it other common apps on Linux. Might be too common with ZFS but keep in mind that the new technologies on existing platforms take time to mature like any other OSS project out there!

I've had ZFS on Linux in production systems backed by RHEL/CentOS for over a year, moving quickly away from my NexentaStor installations.

  • I don't think it makes sense to leverage ZFS as a boot OS when Linux has other stable/proven alternatives. ZFS is for the data drives only.

  • Getting the filesystems to mount deterministically usually requires setting an /etc/systemid value.

  • For pool creation and device naming, I use the WWNs of the devices found in /dev/disk-by-id rather than the typical /dev/sdX entries. This makes the pools somewhat portable and immune to problems that come from adding/removing controllers and device renaming.

  • I can't speak to TRIM, ACLs, etc.. It hasn't been a problem yet in my usage.

  • I need to double-check my disk-failure history. I'm not running spares on most of my ZFS on Linux data pools, but I don't believe you need the FMA to trigger things like a spare rebuild. I do think the zpool "autoreplace" property handles this.

  • For memory, I manually limit ARC size to about 40%-45% of available RAM since ZFS and the Linux virtual memory subsystem tend to fight. This resolved long-term issues with things like rsyncing large file trees. There are a few other knobs that need twisting, but performance has been great.
By linux-user at 2013-12-19 11:56:04:

I've started using ZFS a lot and like it. However here are two problems no one else has mentioned.

1. Putting a zfs vdev on a luks encrypted partition doesn't work well. The problem is that the mount has to be deferred until a password is requested and the partition unlocked. I believe that the problem is solvable; the solution is probably quite simple. However, I have spent too much time researching and trying, w/o luck.

2. Since ubuntu boot ISOs don't have zfs, booting from a flash drive to rescue your system won't let you access your zfs files. I know that you're supposed to be able to build a customized ISO with extra modules. However the process is quite tedious and often fails. E.g., for a long time, the gnome usb-creator program produced garbage.

By Evert Wiesenekker at 2014-05-08 18:07:20:

Well thanks to your posting I now know why my first time experiences with Linux (I tried Centos and Ubuntu 12) with ZFS gave memory problems. I was running three virtual machines on Virtual Box and while I was shutting down VM's memory was not released. Sometimes VM's got aborted.

I decided to drop ZFS and am now running the same VM's on the ext filesystem without any problems.

Written on 02 September 2013.
« A little bit more on ZFS RAIDZ read performance
What (and how) I use HTML tables for layout here »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 2 22:13:52 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.