Things I have learned about ZFS (and a Linux iSCSI target)
I've been testing ZFS over iSCSI storage as an NFS server recently, which has caused me to discover a number of interesting things. In the order that I discovered them:
- each ZFS pool has a cache that has a minimum size that it won't shrink
below, no matter what the memory pressure is; this is apparently
10 MB by default. If you have 2 GB of memory on a Solaris 10U4
x86 machine and 132 separate pools, there is not enough memory
left after all their caches fill to this level to let the machine
keep running regular programs.
(Because the caches initially start out empty and thus much smaller, the machine will boot and seem healthy until you do enough IO to enough pools. This can be mysterious, especially if your IO load is to
zfs scrubthem one after another.)
Our workaround was to add more memory to the Solaris machine; it seems happy at 4GB.
- a Linux iSCSI target machine needs more than 512 MB of memory to support
simultaneous IO against 132 LUNs. If you have only 512 MB and throw enough
IO against the machine, the kernel will get hung in an endless OOM loop.
(Your mileage may differ depending on which iSCSI target implementation you're using.)
- despite being released in 2007, Solaris 10 U4 still defaults to running only 16 NFS server threads.
- however, increasing this to 1024 threads (the commonly advised starting
point) and then trying to do simultaneous IO against 132 ZFS pools from
an NFS client will cause your now 4GB server to bog down into complete
unusability. (At one point I saw a load average of 4000.)
This appears to happen because NFS server threads are very high priority threads on Solaris, so if you have too many of them they can eat all of your server for breakfast. 1024 is definitely too many; 512 may yet prove to be too many, but has survived so far.
- ZFS has really aggressive file-level prefetching, even when used as an
NFS server and even when the system is under enough pressure that
most of it gets thrown away. For example, if you have 132 streams
of sequential read IO, Solaris can wind up wasting 90% to 95% of
the IO it does.
(It is easiest to see this if you have a Linux iSCSI target and a Linux NFS client, because then you can just measure the network bandwidth usage of both. At 132 streams, the iSCSI target was transmitting at 118 MBytes/sec but the NFS client was receiving only 6 MBytes/sec.)
The workaround for this is to turn off ZFS file prefetching (following the directions from the ZFS Evil Tuning Guide ). Unfortunately this costs you noticeable performance on single-stream sequential IO.
It is possible that feeding the server yet more memory would help with this, but going beyond 4 GB of memory for the hardware we're planning to use as our NFS servers will be significantly expensive (we'd have to move to 2GB DIMMs, which are still pricey).
Given that we have a general NFS environment, I suspect that we are going to have to accept that tradeoff; better a system that's slower than it could be when it's under low load than a system that totally goes off the cliff when it's under high load.