My ZFS On Linux memory problem: competition from the page cache
When I moved my data to ZFS on Linux on my office workstation, I didn't move the entire system to ZoL for various reasons. My old setup was having my data on ext4 in LVM on a software RAID mirror, but having the root filesystem and swap in separate software RAID mirrors outside of LVM. When I moved to ZoL, I converted the ext4 in LVM on MD portion to a ZFS pool with the various data filesystems (my home directory, virtual machine images, and so on), but I left the root fileystem (and swap) alone. The net effect is that I was left with a relatively small ext4 root filesystem and a relatively large ZFS pool that had all of my real data.
ZFS on Linux does its best to integrate into the Linux kernel memory management system, and these days that seems to be pretty good. But it doesn't take part in the kernel's generic filesystem page cache; instead it has its own system for this, called ARC. In effect, in a system with both conventional filesystems (such as my ext4 root filesystem) and ZFS filesystems, you have two page caches, one used by your conventional filesystems and the separate one used by your ZFS filesystems. What I found on my machine was that the overall system was bad at balancing memory usage between these two. In particular, the ZFS ARC didn't seem to compete strongly enough with the kernel page cache.
If everything was going well, what I'd have expected was for there to be relatively little kernel page cache and a relatively large ARC, because ext4 (the page cache user) held much less of my actively used data than ZFS did. Certainly I ran some things from the root filesystem (such as compilers), but I rather thought not necessarily all that much compared to what I was doing from my ZFS filesystems. In real life, things seemed to go the other way; I would wind up with a relatively large page cache and a quite tiny ARC that was caching relatively little data. As far as I could tell, over time ext4 was simply out-competing ZFS for memory despite having much less actual filesystem data.
I assume that this is due to the page cache and the ZFS ARC being separate from each other, so that there just isn't any way of having some sort of global view of disk buffer usage that would let the ZFS ARC push directly on the page cache (and vice versa if necessary). As far as I know there's no way to limit page cache usage or push more aggressively only on the page cache in a way that won't hit ZFS at least as hard. So a mixed system with ZFS and something else just gets to live with this.
(The crude fix is to periodically empty the page cache with
echo 1 >/proc/sys/vm/drop_caches'. Note that you don't want
to use '2' or '3' here, because that will also drop ZFS ARC caches
and other ZFS memory.)
The build up of page cache is not immediate when the system is in use. Instead it seems to come over time, possibly as things like system backups run and temporarily pull large amounts of the root filesystem into cache. I believe it generally took a day or two for page cache usage to grow and start strangling the ARC after I explicitly dropped the caches, and in the mean time I could do relatively root filesystem intensive things like compiling Firefox from source without making the page cache hulk up.
(The Firefox source code itself was in a ZFS filesystem, but the C++ compiler and system headers and system libraries and so on are all in the root filesystem.)
Moving from 16 GB of RAM to 32 GB of RAM hasn't eliminated this problem for me as such, but what it did do was allow the ZFS ARC to use enough memory despite this that it's reasonably effective anyways. With 16 GB I could see ext4 using 4 GB of page cache or more while the ZFS ARC was squeezed to 4 GB or less and suffering for it. With 32 GB the page cache may wind up at 10 GB, but this still leaves room for the ZFS ARC itself to be 9 GB with a reasonable amount of data cached.
(And 32 GB is more useful than 16 GB anyways these days, what with OSes in virtual machines that want 2 GB or 4 GB of RAM to run reasonably happily and so on. And I'm crazy enough to spin up multiple virtual machines at once for testing certain sorts of environments.)
Presumably this is completely solved if you have no non-ZFS filesystems on the machine, but that requires a ZFS root filesystem and that's currently far too alarming for me to even consider.
Comments on this page:Written on 27 December 2014.