ARC memory reclaim statistics exposed by ZFS on Linux (as of ZoL 2.1)

April 17, 2023

Yesterday I talked about some important ARC memory stats, to go with stats on how big the ARC is. The ARC doesn't just get big and have views on memory; it also has information about when it shrinks and somewhat about why. Most of these are exposed as event counters in /proc/spl/kstat/zfs/arcstats, with arc_need_free as an exception (it counts how many bytes ZFS thinks it currently wants to shrink the ARC by).

The Linux kernel's memory management has 'shrinkers', which are callbacks into specific subsystems (like ZFS) that the memory management invokes to reduce memory usage. These shrinkers operate in two steps; first the kernel asks the subsystem how much memory it could possibly return, and then it asks the subsystem to do it. The basic amount of memory that the ARC can readily return to the system is the sum of mru_evictable_data, mru_evictable_metadata, mfu_evictable_data, and mfu_evictable_metadata (the actual answer is more complicated, see arc_evictable_memory() in arc_os.c). Normally this is limited by zfs_arc_shrinker_limit, so any single invocation will only ask the ARC to drop at most 160 MBytes.

Every time shrinking happens, the ARC target size is reduced by however much the kernel asked ZFS to shrink, arc_no_grow is set to true, and either memory_indirect_count or memory_direct_count is increased. If the shrinking is being done by Linux's kswapd, it is an indirect count; if the shrinking is coming from a process trying to allocate memory, finding it short, and directly triggering memory reclaiming (a 'direct reclaim'), it is a direct count. Direct reclaims are considered worse than indirect reclaims, because they indicate that kswapd wasn't able to keep up with the memory demand and other processes were forced to throttle.

(I believe the kernel may ask ZFS to drop less memory than ZFS reported it could potentially drop.)

The ARC has limits on how much metadata it will hold, both general metadata, arc_meta_limit versus arc_meta_used, and for dnodes specifically, arc_dnode_limit versus dnode_size. When the ARC shrinks metadata, it may need to 'prune' itself by having filesystems release dnodes and other things they're currently holding on to. When this triggers, arc_prune will count up by some amount; I believe this will generally be one per currently mounted filesystem (see arc_prune_async() in arc_os.c).

When the ARC is evicting data, it can increase two statistics, evict_skip and evict_not_enough. The latter is the number of times ARC eviction wasn't able to evict enough to reach its target amount. For the former, let's quote arc_impl.h:

Number of buffers skipped because they have I/O in progress, are indirect prefetch buffers that have not lived long enough, or are not from the spa we're trying to evict from.

ZFS can be asked to evict a certain amount or all things of a particular class that are evictable, such as MRU metadata. Only the former case can cause evict_not_enough to count up.

In addition to regular data, the ARC can store 'anonymous' data. I'll quote arc_impl.h again:

Anonymous buffers are buffers that are not associated with a DVA. These are buffers that hold dirty block copies before they are written to stable storage. By definition, they are "ref'd" and are considered part of arc_mru that cannot be freed. Generally, they will acquire a DVA as they are written and migrate onto the arc_mru list.

The size of these are the anon_size kstat. Although there are anon_evictable_data and anon_evictable_metadata stats, I believe they're always zero because anonymous dirty buffers probably aren't evictable. Some of the space counted here may be 'loaned out' and shows up in arc_loaned_bytes.

As part of setting up writes, ZFS will temporarily reserve ARC space for them; the current reservation is reported in arc_tempreserve. Based on the code, the total amount of dirty data in the ARC for dirty space limits and space accounting is arc_tempreserve plus anon_size, minus arc_loaned_bytes.

Under some situations that aren't clear to me, ZFS may feel it needs to throttle new memory allocations for writes. When this happens, memory_throttle_count will increase by one. This seems to be rare as it's generally zero on our systems.

Written on 17 April 2023.
« Some important ARC memory statistics exposed by ZFS on Linux (as of ZoL 2.1)
When and how ZFS on Linux changes the ARC target size (as of ZoL 2.1) »

Page tools: View Source.
Search:
Login: Password:

Last modified: Mon Apr 17 22:47:21 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.