Things I have learned about ZFS (and a Linux iSCSI target)

May 13, 2008

I've been testing ZFS over iSCSI storage as an NFS server recently, which has caused me to discover a number of interesting things. In the order that I discovered them:

  • each ZFS pool has a cache that has a minimum size that it won't shrink below, no matter what the memory pressure is; this is apparently 10 MB by default. If you have 2 GB of memory on a Solaris 10U4 x86 machine and 132 separate pools, there is not enough memory left after all their caches fill to this level to let the machine keep running regular programs.

    (Because the caches initially start out empty and thus much smaller, the machine will boot and seem healthy until you do enough IO to enough pools. This can be mysterious, especially if your IO load is to zfs scrub them one after another.)

    Our workaround was to add more memory to the Solaris machine; it seems happy at 4GB.

  • a Linux iSCSI target machine needs more than 512 MB of memory to support simultaneous IO against 132 LUNs. If you have only 512 MB and throw enough IO against the machine, the kernel will get hung in an endless OOM loop.

    (Your mileage may differ depending on which iSCSI target implementation you're using.)

  • despite being released in 2007, Solaris 10 U4 still defaults to running only 16 NFS server threads.
  • however, increasing this to 1024 threads (the commonly advised starting point) and then trying to do simultaneous IO against 132 ZFS pools from an NFS client will cause your now 4GB server to bog down into complete unusability. (At one point I saw a load average of 4000.)

    This appears to happen because NFS server threads are very high priority threads on Solaris, so if you have too many of them they can eat all of your server for breakfast. 1024 is definitely too many; 512 may yet prove to be too many, but has survived so far.

  • ZFS has really aggressive file-level prefetching, even when used as an NFS server and even when the system is under enough pressure that most of it gets thrown away. For example, if you have 132 streams of sequential read IO, Solaris can wind up wasting 90% to 95% of the IO it does.

    (It is easiest to see this if you have a Linux iSCSI target and a Linux NFS client, because then you can just measure the network bandwidth usage of both. At 132 streams, the iSCSI target was transmitting at 118 MBytes/sec but the NFS client was receiving only 6 MBytes/sec.)

    The workaround for this is to turn off ZFS file prefetching (following the directions from the ZFS Evil Tuning Guide ). Unfortunately this costs you noticeable performance on single-stream sequential IO.

    It is possible that feeding the server yet more memory would help with this, but going beyond 4 GB of memory for the hardware we're planning to use as our NFS servers will be significantly expensive (we'd have to move to 2GB DIMMs, which are still pricey).

Given that we have a general NFS environment, I suspect that we are going to have to accept that tradeoff; better a system that's slower than it could be when it's under low load than a system that totally goes off the cliff when it's under high load.

Comments on this page:

From at 2008-05-16 01:32:52:

Just wondering why you are using 132 separate zpools?

James C. McPherson (I work for Sun)

By cks at 2008-05-17 00:08:37:

I'm testing with 132 pools now because it's a convenient large number (without tricks we get a maximum of 11 partitions/LUNs per physical disk on the ISCSI target and we have 12 physical disks in the target). The reason we're looking at lots of disks in general is sufficiently long that I put it in an entry, WhyManyZFSPools.

Written on 13 May 2008.
« Some thoughts on tradeoffs between storage models
What protects the strength of a ssh connection's encryption »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 13 23:26:18 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.