My ZFS On Linux memory problem: competition from the page cache

December 27, 2014

When I moved my data to ZFS on Linux on my office workstation, I didn't move the entire system to ZoL for various reasons. My old setup was having my data on ext4 in LVM on a software RAID mirror, but having the root filesystem and swap in separate software RAID mirrors outside of LVM. When I moved to ZoL, I converted the ext4 in LVM on MD portion to a ZFS pool with the various data filesystems (my home directory, virtual machine images, and so on), but I left the root fileystem (and swap) alone. The net effect is that I was left with a relatively small ext4 root filesystem and a relatively large ZFS pool that had all of my real data.

ZFS on Linux does its best to integrate into the Linux kernel memory management system, and these days that seems to be pretty good. But it doesn't take part in the kernel's generic filesystem page cache; instead it has its own system for this, called ARC. In effect, in a system with both conventional filesystems (such as my ext4 root filesystem) and ZFS filesystems, you have two page caches, one used by your conventional filesystems and the separate one used by your ZFS filesystems. What I found on my machine was that the overall system was bad at balancing memory usage between these two. In particular, the ZFS ARC didn't seem to compete strongly enough with the kernel page cache.

If everything was going well, what I'd have expected was for there to be relatively little kernel page cache and a relatively large ARC, because ext4 (the page cache user) held much less of my actively used data than ZFS did. Certainly I ran some things from the root filesystem (such as compilers), but I rather thought not necessarily all that much compared to what I was doing from my ZFS filesystems. In real life, things seemed to go the other way; I would wind up with a relatively large page cache and a quite tiny ARC that was caching relatively little data. As far as I could tell, over time ext4 was simply out-competing ZFS for memory despite having much less actual filesystem data.

I assume that this is due to the page cache and the ZFS ARC being separate from each other, so that there just isn't any way of having some sort of global view of disk buffer usage that would let the ZFS ARC push directly on the page cache (and vice versa if necessary). As far as I know there's no way to limit page cache usage or push more aggressively only on the page cache in a way that won't hit ZFS at least as hard. So a mixed system with ZFS and something else just gets to live with this.

(The crude fix is to periodically empty the page cache with 'echo 1 >/proc/sys/vm/drop_caches'. Note that you don't want to use '2' or '3' here, because that will also drop ZFS ARC caches and other ZFS memory.)

The build up of page cache is not immediate when the system is in use. Instead it seems to come over time, possibly as things like system backups run and temporarily pull large amounts of the root filesystem into cache. I believe it generally took a day or two for page cache usage to grow and start strangling the ARC after I explicitly dropped the caches, and in the mean time I could do relatively root filesystem intensive things like compiling Firefox from source without making the page cache hulk up.

(The Firefox source code itself was in a ZFS filesystem, but the C++ compiler and system headers and system libraries and so on are all in the root filesystem.)

Moving from 16 GB of RAM to 32 GB of RAM hasn't eliminated this problem for me as such, but what it did do was allow the ZFS ARC to use enough memory despite this that it's reasonably effective anyways. With 16 GB I could see ext4 using 4 GB of page cache or more while the ZFS ARC was squeezed to 4 GB or less and suffering for it. With 32 GB the page cache may wind up at 10 GB, but this still leaves room for the ZFS ARC itself to be 9 GB with a reasonable amount of data cached.

(And 32 GB is more useful than 16 GB anyways these days, what with OSes in virtual machines that want 2 GB or 4 GB of RAM to run reasonably happily and so on. And I'm crazy enough to spin up multiple virtual machines at once for testing certain sorts of environments.)

Presumably this is completely solved if you have no non-ZFS filesystems on the machine, but that requires a ZFS root filesystem and that's currently far too alarming for me to even consider.

Comments on this page:

By Joe Rhodes at 2016-01-02 11:40:37:

I'm running a similar setup as yourself and I noticed the same thing. I think I've found a better solution than dropping the page cache. In ZFS, you can set a minimum ARC size. (By default, it seems to be set at 32 MB.) It seems the ARC will push back harder against the page cache for whatever you define as the minimum.

I'm testing it right now.  I have  a backup running to an EXT4 FS that was pushing nearly all the ARC out of RAM.  But when I set the minimum to 16 GB, (I have 64 GB system) the arc size immediately began to grow up to 16 GB, even with the backup running.  I have the Max set as 32 GB, but it's not going over 16 with the backup running.  If I drop the page cache as per your post, it will grow to 32 GB, but then drop back down to 16 GB fairly quickly as the page cache fills again.

The tuneable is at /sys/module/zfs/parameters/zfs_arc_min. Obviously, you'll need to do some calculations to make sure you don't overcommit your RAM in favour of ARC.

By cks at 2016-01-06 15:58:21:

My experience is that more recent versions of ZFS on Linux have fixed most or all of the ARC size issues for me; I wrote about this here. I think I've seen a few ARC size collapses every now and then, but nothing like what used to happen to me when I wrote about it in this entry.

I am somewhat irrationally jumpy about setting a minimum ARC size, because years ago we had serious problems when we did this on Solaris. That was years ago and Linux kernel memory management is nothing like Solaris kernel memory management, but the scars remain.

Written on 27 December 2014.
« My experience with ZFS on Linux: it's worked pretty well for me
How I think DNSSec will have to be used in the real world »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Dec 27 02:25:07 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.