2011-04-17
The ultimate (for now) answer for our ZFS ARC size problem
I've mentioned in passing before (here and here) that we have had a long-standing problem where our ZFS ARC sizes would basically collapse; the ARC would spontaneously decide to limit itself to 2 GBytes or so despite the machines having 8 GBytes and being basically unused apart from NFS fileservice. In the end, I believe I've figured out why this happened to us. The short answer is kernel memory fragmentation.
(At this point I will pause to mention that we are running more or less Solaris 10 update 8, because this is about to become important.)
Simplifying somewhat, Solaris allocates most kernel memory structures using an arena-based slab allocator; common sorts of objects have their own separate arenas. As with all slab allocators, the memory system can only return slab pages to the free pool if all objects on a particular page are free; even a single object still used will cause an entire page to be retained.
ZFS has an arena for dnode_t structures, which are the rough ZFS
equivalent of inodes. On the Solaris fileservers with the ARC size
collapse, Solaris kernel stats show that this arena has very low
utilization; 16% of the allocated dnode_t's being used is typical.
Since Solaris is unable to reduce the size of this arena, I think it
must be heavily fragmented.
This leaves us with two puzzles: what's causing the arena to grow, and
what's keeping a random scattering of dnode_t structures busy. I
have a potential answer for the first puzzle; as it happens, we have
a number of periodic jobs that walk all of the ZFS filesystems on a
fileserver, and when they're running the dnode_t arena utilization
climbs dramatically. I have no answer for the second puzzle right now
(and haven't looked very hard for one).
There is code in OpenSolaris to support defragmenting arenas by moving allocated objects between slab pages (with the cooperation of the owner of the objects). However, this code is not in Solaris 10 update 8 (and I don't know if it's in S10U9 either, or even Solaris 11 Express).
2011-04-07
Why I don't like Solaris boot archives, illustrated
I've written before about how I don't like Solaris boot archives. Let me condense that previous writing and illustrate it with a recent event. First, the short reason I don't like boot archives:
The problem with boot archives is that they make my machines unbootable for unimportant reasons.
It might be okay if boot archives only got out of date because of important things and important changes, especially if things were interrupted partway through. But that's not what actually happens.
Recently I decided to rebuild the boot archives on all of our fileservers
as a precaution. As far as I knew, nothing had changed on them recently
that should affect the boot archives (the fileserver configurations are
basically static), but it seemed to be a good idea just in case.
Somewhat to my surprise, most of the fileservers needed to update their
boot archives, and they needed to do this because /etc/rtc_config
had changed. This file contains basic timezone information, and on all
of those machines it had last been updated March 14th at 2:01 am.
(People in most of North America may recognize that date and time.)
An out of date boot archive prevents your machine from rebooting unattended if it crashes, loses power, or otherwise suffers some problem. So the end result is that Solaris decided that the change from standard time to daylight savings time was sufficiently important that it should prevent our systems from automatically recovering from unexpected outages.
Good going, Solaris.
The net result of this is that our fileservers now automatically rebuild their boot archives every night. This is still only an imitation of how it should actually work (Solaris should automatically update the boot archive whenever one of its administration commands makes it out of date), but it's better than nothing.
(I continue to have no idea what on earth the Solaris engineers were thinking when they came up with this 'feature' of boot archives. It smells a lot like making apparently logical local decisions without thinking through their global consequences.)