The current weak areas of ZFS on Linux
I've been looking into ZFS on Linux for a while as a potential alternative to Illumos for our next generation of fileservers (FreeBSD is unfortunately disqualified). As part of that I have been working to understand ZoL's current weak areas so that I can better understand where it might cause us problems.
The following is the best current information I have; it comes from reading the ZoL mailing list (and at one point asking the ZoL mailing list this exact question).
The weak areas that I know about:
- Using ZFS as your root filesystem requires wrestling with GRUB,
Grub scripts, initramfs-building scripts, and support in installers
(if you want to install the system as ZoL-root from the start).
How well this works depends on your distribution; some have good
support (eg Gentoo), others
have third party repositories with prebuilt packages, and still
others leave you on your own.
- There are periodic problem reports about getting ZFS filesystems reliably mounted on boot.
- In some environments ZoL can have problems reliably finding the
disk devices for your pools on boot. This is especially likely
if you use
/dev/sd*device names but apparently sometimes happens to people who use more stable identifiers.
(Apparently part of the likely solution is to hook ZoL into udev so that as disks are discovered ZoL checks to see if a pool now has a full set of devices and can be brought up.)
- ZoL lacks a number of standard Linux filesystem features, including
O_DIRECT, asynchronous IO, and POSIX ACLs. It also lacks support for issuing TRIM commands to drives (this is apparently only present in the FreeBSD version of ZFS so far).
- There is no 'event daemon' to handle events like disks going away.
The most significant result of this is that ZFS pool spares do not
get activated on disk failure (making them basically pointless).
- ZFS's use of kernel memory is not well integrated with the Linux
kernel memory system, resulting in runaway memory usage in some
situations. Apparently metadata intensive workloads (such as
rsyncruns) are especially prone to this.
The last issue deserves more discussion. All of this is what I've gathered from the mailing list and from looking at the ZFS on Linux source code.
To start with, ZFS on Linux is not really ZFS ported to Linux; instead it's mostly the Illumos ZFS code dropped on top of a layer of code to translate and emulate the Solaris kernel APIs that ZFS needs (the SPL, short for 'Solaris Porting Layer'). This includes a great deal of kernel memory handling. The unfortunate result of this is a series of mismatches between what ZFS thinks is going on with kernel memory and what is actually going on, due to the translation and emulation that is required. Through fragmentation that's invisible to ZFS and other issues, ZFS can wind up using a lot more memory for things like the ARC than it is supposed to (because ZFS thinks it's using a lot less memory than it actually is).
(I suspect that ZFS itself still has some degree of the ZFS level fragmentation problems we've seen but that's much less dangerous because it just leaves the ARC smaller than it should be. The ZoL problem is that the ARC and related things can eat all of your RAM and make your kernel explode.)
Whether this happens to you (and how much it affects you) is unpredictable because it depends very much on the details of how your system uses memory. As mentioned, people seem to have problems with metadata heavy workloads but not everyone reporting problems on the ZoL mailing lists is in this situation.
PS: if you are coming here from Internet searches, please pay attention to the date of this entry. I certainly hope that all of these issues will get dealt with over time.
A little bit more on ZFS RAIDZ read performance
Back in this entry I talked about how all levels of ZFS RAIDZ had an unexpected read performance hit: they can't read less than a full stripe, so instead of the IOPS of N disks you get the IOPS of one disk. Well, it was recently pointed out to me that this is not quite correct. It is true that ZFS reads all of the stripe of a data block on reads; however, ZFS does not read the parity chunks (unless the block does not checksum correctly and needs to be repaired).
In normal RAIDZ pools the difference between 'all disks' and 'all disks except the parity disks' is small. If the parity for the stripes you're reading bits of are evenly spread over all of the disks, you might get somewhat more than one disk's IOPS on aggregate. Where this can matter is in very small RAIDZ pools, for example a four-disk RAIDZ2 pool. Here half your drives are parity drives for any particular data block and you may get something more like two disks of IOPS.
(A four-disk RAIDZ2 vdev is actually an interesting thing and potentially useful; it's basically a more resilient but potentially slower version of a two-vdev set of mirrors. You lose half of your disk space, as with mirroring, but you can withstand the failure of any two disks (unlike mirroring).)
To add some more RAIDZ parity trivia: RAIDZ parity is read and verified during scrubs (and thus likely resilvers), which is what you want. Data block checksums are as well of course, which means that reads on scrubs genuinely busy all drives.
Sidebar: small write blocks and read IOPS
Another way that you can theoretically get more than one disk's IOPS from a RAIDZ vdev is if the data was written in sufficiently small blocks. As I mentioned in passing here, ZFS doesn't have a fixed 'stripe size' and a small write will only put data (and parity) on less than N disks. In turn reading back this data will need less than N (minus parity) disks, meaning that if you have good luck you can read another small block from the other drives at the same time.
Since 'one sector' is the minimum amount of data to put on a single drive, this is probably much more likely now in the days of disks with 4096-byte sectors than it was on 512-byte sector drives. If you have a ten-disk RAIDZ2 on 4k disks, for example, it now takes a 32 KB data block to wind up on all 8 possible data drives.
(On 512-byte sector disks it would have only needed a 4KB data block.)