Where your memory can be going with ZFS on Linux

October 10, 2014

If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.

First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer cache. A full discussion of what is included in the ARC and how you measure it and so on is well beyond the scope of this entry, but the short summary is that the ARC includes data from disk, metadata from disk, and several sorts of bookkeeping data. ZoL reports information about it in /proc/spl/kstat/zfs/arcstats, which is exactly the standard ZFS ARC kstats. What ZFS considers to be the total current (RAM) size of the ARC is size. ZFS on Linux normally limits the maximum ARC size to roughly half of memory (this is c_max).

(Some sources will tell you that the ARC size in kstats is c. This is wrong. c is the target size; it's often but not always the same as the actual size.)

Next, RAM can be in slab allocated ZFS objects and data structures that are not counted as part of the ARC for one reason or another. It used to be that ZoL handled all slab allocation itself and so all ZFS slab things were listed in /proc/spl/kmem/slab, but the current ZoL development version now lets the native kernel slab allocator handle most slabs for objects that aren't bigger than spl_kmem_cache_slab_limit bytes, which is normally 16K by default. Such native kernel slabs are theoretically listed in /proc/slabinfo but are unfortunately normally subject to SLUB slab merging, which often means that they get merged with other slabs and you can't actually see how much memory they're using.

As far as slab objects that aren't in the ARC, I believe that zfs_znode_cache slab objects (which are znode_ts) are not reflected in the ARC size. On some machines active znode_t objects may be a not insignificant amount of memory. I don't know this for sure, though, and I'm somewhat reasoning from behavior we saw on Solaris.

Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).

(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)

All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:

  • the ARC is 5.1 GB, which is decent. Most of that is not actual file data, though; file data is reported as 0.27 GB, then there's 1.87 GB of ZFS metadata from disk and a bunch of other stuff.

  • 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is reported in /proc/spl/kmem/slab; the remainder is in native Linux slabs in /proc/slabinfo. The znode_t slab is most of the SPL slab report, at 2 GB used.

    (This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)

  • 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.

If znode_t objects are not in the ARC, the ARC and active znode_t objects account for almost all of the slab space between the two of them; 7.1 GB out of 7.55 GB.

I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.

Sidebar: Accurately tracking ZoL slab memory usage

To accurately track ZoL memory usage you must defeat SLUB slab merging somehow. You can turn it off entirely with the slub_nomerge kernel paramter or hack the spl ZoL kernel module to defeat it (see the sidebar here).

Because you can set spl_kmem_cache_slab_limit as a module parameter for the spl ZoL kernel module, I believe that you can set it to zero to avoid having any ZoL slabs be native kernel slabs. This avoids SLUB slab merging entirely and also makes it so that all ZoL slabs appear in /proc/spl/kmem/slab. It may be somewhat less efficient.

Written on 10 October 2014.
« How /proc/slabinfo is not quite telling you what it looks like
Thinking about how to create flexible aggregations from causes »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 10 01:24:46 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.