The ultimate (for now) answer for our ZFS ARC size problem

April 17, 2011

I've mentioned in passing before (here and here) that we have had a long-standing problem where our ZFS ARC sizes would basically collapse; the ARC would spontaneously decide to limit itself to 2 GBytes or so despite the machines having 8 GBytes and being basically unused apart from NFS fileservice. In the end, I believe I've figured out why this happened to us. The short answer is kernel memory fragmentation.

(At this point I will pause to mention that we are running more or less Solaris 10 update 8, because this is about to become important.)

Simplifying somewhat, Solaris allocates most kernel memory structures using an arena-based slab allocator; common sorts of objects have their own separate arenas. As with all slab allocators, the memory system can only return slab pages to the free pool if all objects on a particular page are free; even a single object still used will cause an entire page to be retained.

ZFS has an arena for dnode_t structures, which are the rough ZFS equivalent of inodes. On the Solaris fileservers with the ARC size collapse, Solaris kernel stats show that this arena has very low utilization; 16% of the allocated dnode_t's being used is typical. Since Solaris is unable to reduce the size of this arena, I think it must be heavily fragmented.

This leaves us with two puzzles: what's causing the arena to grow, and what's keeping a random scattering of dnode_t structures busy. I have a potential answer for the first puzzle; as it happens, we have a number of periodic jobs that walk all of the ZFS filesystems on a fileserver, and when they're running the dnode_t arena utilization climbs dramatically. I have no answer for the second puzzle right now (and haven't looked very hard for one).

There is code in OpenSolaris to support defragmenting arenas by moving allocated objects between slab pages (with the cooperation of the owner of the objects). However, this code is not in Solaris 10 update 8 (and I don't know if it's in S10U9 either, or even Solaris 11 Express).


Comments on this page:

From 24.245.7.79 at 2011-04-17 09:55:38:

Chris - there are bugs in ZFS's arc cache code that may be related. We run large arc caches (up to100GB) and have hit cases where 20GB gets evicted from the cache. Server performance suffers greatly (to say the least) during the evict.

Notes from a Sun engineer:

>> >> IDR145698-01 IDR contains fixes for following CR: >> >> 6950219 large ghost eviction causes high write latency >> 6953403 arc_adjust might adjust MRU unnecessarily >> 6951024 arc_adapt can lead to wild arc_p adjustment >> --------------------------- >>

--Mike lastinfirstout.net As of last fall there was an IDR that couple be applied. I don't recall when the bug fixes will make it to a kernel patch.

From 24.245.7.79 at 2011-04-17 09:58:55:
Sorry about the formatting. It looks like wiki-text has defeated me. :)

--Mike (lastinfirstout.net)

Written on 17 April 2011.
« Cache validators versus cache invalidation
What made X Windows so special »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Apr 17 01:38:10 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.