Where your memory can be going with ZFS on Linux
If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.
First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer
cache. A full discussion of what is included in the ARC and how you
measure it and so on is well beyond the scope of this entry, but the
short summary is that the ARC includes data from disk, metadata from
disk, and several sorts of bookkeeping data. ZoL reports information
about it in
/proc/spl/kstat/zfs/arcstats, which is exactly the
standard ZFS ARC kstats. What ZFS considers to be the total current
(RAM) size of the ARC is
size. ZFS on Linux normally limits the
maximum ARC size to roughly half of memory (this is
(Some sources will tell you that the ARC size in kstats is
This is wrong.
c is the target size; it's often but not always
the same as the actual size.)
Next, RAM can be in slab allocated ZFS objects and data
structures that are not counted as part of the ARC for one reason
or another. It used to be that ZoL handled all slab allocation
itself and so all ZFS slab things were listed in
but the current ZoL development version now lets the native kernel slab
allocator handle most slabs for objects that aren't bigger than
spl_kmem_cache_slab_limit bytes, which is normally 16K by
default. Such native kernel slabs are theoretically listed in
/proc/slabinfo but are unfortunately normally subject to SLUB
slab merging, which often means that they get
merged with other slabs and you can't actually see how much memory
As far as slab objects that aren't in the ARC, I believe that
zfs_znode_cache slab objects (which are
znode_ts) are not
reflected in the ARC size. On some machines active
objects may be a not insignificant amount of memory. I don't know
this for sure, though, and I'm somewhat reasoning from behavior
we saw on Solaris.
Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).
(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)
All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:
- the ARC is 5.1 GB, which is decent. Most of that is not actual file
data, though; file data is reported as 0.27 GB, then there's 1.87
GB of ZFS metadata from disk and a bunch of other stuff.
- 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is
/proc/spl/kmem/slab; the remainder is in native Linux slabs in
znode_tslab is most of the SPL
slabreport, at 2 GB used.
(This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)
- 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.
znode_t objects are not in the ARC, the ARC and active
znode_t objects account for almost all of the slab space
between the two of them; 7.1 GB out of 7.55 GB.
I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.
Sidebar: Accurately tracking ZoL slab memory usage
To accurately track ZoL memory usage you must defeat SLUB slab
merging somehow. You can turn it off entirely with the
kernel paramter or hack the
spl ZoL kernel module to defeat it
(see the sidebar here).
Because you can set
spl_kmem_cache_slab_limit as a module
parameter for the
spl ZoL kernel module, I believe that you can
set it to zero to avoid having any ZoL slabs be native kernel slabs.
This avoids SLUB slab merging entirely and also makes it so that
all ZoL slabs appear in
/proc/spl/kmem/slab. It may be somewhat