2023-04-17
ARC memory reclaim statistics exposed by ZFS on Linux (as of ZoL 2.1)
Yesterday I talked about some important ARC memory stats, to go with stats on how big the ARC is. The ARC doesn't just get big and have views on memory; it also has information about when it shrinks and somewhat about why. Most of these are exposed as event counters in /proc/spl/kstat/zfs/arcstats, with arc_need_free as an exception (it counts how many bytes ZFS thinks it currently wants to shrink the ARC by).
The Linux kernel's memory management has 'shrinkers', which are
callbacks into specific subsystems (like ZFS) that the memory
management invokes to reduce memory usage. These shrinkers operate
in two steps; first the kernel asks the subsystem how much memory
it could possibly return, and then it asks the subsystem to do it.
The basic amount of memory that the ARC can readily return to the
system is the sum of mru_evictable_data
, mru_evictable_metadata
,
mfu_evictable_data
, and mfu_evictable_metadata
(the actual
answer is more complicated, see arc_evictable_memory() in
arc_os.c).
Normally this is limited by zfs_arc_shrinker_limit,
so any single invocation will only ask the ARC to drop at most 160
MBytes.
Every time shrinking happens, the ARC target size is reduced by
however much the kernel asked ZFS to shrink, arc_no_grow is set
to true, and either memory_indirect_count
or memory_direct_count
is increased. If the shrinking is being done by Linux's kswapd, it
is an indirect count; if the shrinking is coming from a process
trying to allocate memory, finding it short, and directly triggering
memory reclaiming (a 'direct reclaim'), it is a direct count. Direct
reclaims are considered worse than indirect reclaims, because they
indicate that kswapd wasn't able to keep up with the memory demand
and other processes were forced to throttle.
(I believe the kernel may ask ZFS to drop less memory than ZFS reported it could potentially drop.)
The ARC has limits on how much metadata it will hold, both general metadata, arc_meta_limit versus arc_meta_used, and for dnodes specifically, arc_dnode_limit versus dnode_size. When the ARC shrinks metadata, it may need to 'prune' itself by having filesystems release dnodes and other things they're currently holding on to. When this triggers, arc_prune will count up by some amount; I believe this will generally be one per currently mounted filesystem (see arc_prune_async() in arc_os.c).
When the ARC is evicting data, it can increase two statistics,
evict_skip
and evict_not_enough
. The latter is the number of
times ARC eviction wasn't able to evict enough to reach its target
amount. For the former, let's quote arc_impl.h:
Number of buffers skipped because they have I/O in progress, are indirect prefetch buffers that have not lived long enough, or are not from the spa we're trying to evict from.
ZFS can be asked to evict a certain amount or all things of a particular
class that are evictable, such as MRU metadata. Only the former case can
cause evict_not_enough
to count up.
In addition to regular data, the ARC can store 'anonymous' data. I'll quote arc_impl.h again:
Anonymous buffers are buffers that are not associated with a DVA. These are buffers that hold dirty block copies before they are written to stable storage. By definition, they are "ref'd" and are considered part of arc_mru that cannot be freed. Generally, they will acquire a DVA as they are written and migrate onto the arc_mru list.
The size of these are the anon_size
kstat. Although there are
anon_evictable_data
and anon_evictable_metadata
stats, I
believe they're always zero because anonymous dirty buffers probably
aren't evictable. Some of the space counted here may be 'loaned out'
and shows up in arc_loaned_bytes
.
As part of setting up writes, ZFS will temporarily reserve ARC space
for them; the current reservation is reported in arc_tempreserve
.
Based on the code, the total amount of dirty data in the ARC for
dirty space limits and space accounting is arc_tempreserve
plus
anon_size
, minus arc_loaned_bytes
.
Under some situations that aren't clear to me, ZFS may feel it needs
to throttle new memory allocations for writes. When this happens,
memory_throttle_count
will increase by one. This seems to be
rare as it's generally zero on our systems.