2023-04-18
When and how ZFS on Linux changes the ARC target size (as of ZoL 2.1)
Previously I discussed the various sizes of the ARC, some important ARC memory stats, and ARC memory reclaim stats. Today I can finally talk about how the ZFS ARC target size shrinks, and a bit about how it grows, which is a subject of significant interest and some frustration. I will be citing ZoL function names because tools like bpftrace mean you can hook into them to monitor ARC target size changes.
(Changes in the actual size of the ARC are less interesting than changes in the ARC target size. Generally the actual size promptly fills up to the target size if you're doing enough IO, although metadata versus data balancing can throw a wrench in the works.)
The ARC target size is shrunk by arc_reduce_target_size() (in arc.c), which takes as its argument the size (in bytes) to reduce arc_c by and almost always does so (unless you've hit the minimum size). There are two paths to calling it, through reaping, where ZFS periodically checks to see if it thinks there's not enough memory available, and shrinking, where the Linux kernel memory management system asks ZFS to shrink its memory use.
Reaping is a general ZFS facility where a dedicated kernel thread wakes up at least once every second to check if memory_available_bytes is negative. If it is, ZFS sets arc_no_grow, kicks off reclaiming memory, waits about a second, and then potentially shrinks the ARC target size by:
( (arc_c - arc_c_min) / 128 ) - memory_available_bytes
(The divisor will be different if you've tuned zfs_arc_shrink_shift. This is done in arc_reap_cb(), and see also arc_reap_cb_check().)
Because reaping waits a second after starting the reclaim, this number may not be positive (because the reclaim raised the amount of available bytes enough); if this has happened, arc_c is left unchanged. This reaping thread ticks once a second and may also be immediately woken up by arc_adapt(), which is called when ZFS is reading a new disk block into memory and which will check to see if memory_available_bytes is below zero.
My bpftrace-based measurements so far suggest that when reaping triggers, it normally makes relative large adjustments in the ARC target size; I routinely see 300 and 400 MiB reductions even on my desktops. Since the ARC target size reduction starts out at 1/128th of the difference between the current ARC target size and the minimum size, a system with a lot of memory and a large ARC size may experience very abrupt drops through reaping, especially if you've raised the maximum ARC size and left the minimum size alone.
The shrinking path is invoked through the Linux kernel's general memory management feature of kernel subsystems having 'shrinkers' that kernel memory management can invoke to reduce the subsystem's memory usage (this came up in memory reclaim stats). When the kernel's memory management decides that it wants subsystems to shrink, it will first call arc_shrinker_count() to see how much memory the ARC can return and then maybe call arc_shrinker_scan() to actually do the shrinking. The amount of memory ARC will claim it can return is calculated in a complex way (see yesterday's discussion) and is capped at zfs_arc_shrinker_limit pages (normally 4 KiBytes each). All of this is in arc_os.c. Shrinking, unlike reaping, always immediately reduces arc_c by however much the kernel wound up asking it to shrink by.
Although you might expect otherwise, the kernel's memory subsystem can invoke the ARC shrinker even without any particular sign of memory pressure, and when it does so it often only asks the ARC to drop 128 pages (512 KiB) of data instead of the full amount that the ARC offers. It can also do this in rapid bursts, which obviously adds up to much more than just 512 KiB of total ARC target size reduction.
Every time shrinking happens, one or the other of memory_indirect_count and memory_direct_count are increased. No statistic is increased if reaping happens, or if reaping leads to the ARC target size being reduced (which it doesn't always). If you need that information, you'll have to instrument things with something like the EBPF exporter. Writing the relevant BCC or bpftrace programs is up to you.
How and when the ARC target size is increased again is harder to observe, although it's more centralized. The ARC target size is grown in arc_adapt(), but unfortunately not all of the time; it's only grown if the current ARC size is within 32 MiBytes of the target ARC size (and the ARC can grow at all, ie arc_no_grow is zero and there's no reclaim needed). As of ZoL 2.1, the ARC target size is grown by however many bytes were being read from disk, which may be as small as 4 KiB; in the current development version, that's changed to a minimum of 128 KiB. As mentioned before, arc_adapt() seems to be called only when ZFS wants to read new things from disk (with a minor exception for some L2ARC in-RAM structures).
(That the growth decision is buried away inside the depths of arc_adapt() makes it hard to monitor even with bpftrace, especially since arc_c itself isn't accessible to bpftrace.)
One consequence of this is that even if the ARC target size can grow, it only grows on ARC misses that trigger disk IO. If all of your requests are being served from the current ARC, ZFS won't bother growing the target size. This makes sense, but is potentially frustrating and I believe it can cause the ARC target size to 'stick' at alarmingly low levels for a while on a system that still has high ARC hit rates even on a reduced-size ARC, or low IO levels.
Sidebar: the shrinker call stack bpftrace has observed
I had bpftrace print call stacks for arc_shrinker_scan(), and what I got in my testing was:
arc_shrinker_scan+1 do_shrink_slab+318 shrink_slab+170 shrink_node+572 balance_pgdat+792 kswapd+496 [...]
I lack the energy to try to decode why the kernel would go down this particular path and what kernel memory metrics one would look at to predict it.