2015-07-20
The OmniOS kernel can hold major amounts of unused memory for a long time
The Illumos kernel (which means the kernels of OmniOS, SmartOS, and so on) has an oversight which can cause it to hold down a potentially large amount of unused memory in unproductive ways. We discovered this on our most heavily used NFS fileserver; on a server with 128 GB of RAM, over 70 GB of RAM was being held down by the kernel and left idle for an extended time. As you can imagine, this didn't help the ZFS ARC size, which got choked down to 20 GB or so.
The problem is in kmem, the kernel's general memory allocator. Kmem is what is called a slab allocator, which means that it divides kernel memory up into a bunch of arenas for different-sized objects. Like basically all sophisticated allocators, kmem works hard to optimize allocation and deallocation; for instance, it keeps a per-CPU cache of recently freed objects so that in the likely case that you need an object again you can just grab it in a basically lock free way. As part of these optimizations, kmem keeps a cache of fully empty slabs (ones that have no objects allocated out of them) that have been freed up; this means that it can avoid an expensive trip to the kernel page allocator when you next want some more objects from a particular arena.
The problem is that kmem does not bound the size of this cache of fully empty slabs and does not age slabs out of it. As a result, a temporary usage surge can leave a particular arena with a lot of unused objects and slab memory, especially if the objects in question are large. In our case, this happened to the arena for 'generic 128 KB allocations'; we spent a long time with around six in use but 613,033 allocated. Presumably at one time we needed that ~74 GB of 128 KB buffers (probably because of a NFS overload situation), but we certainly didn't any more.
Kmem can be made to free up these unused slabs, but in order to do
so you must put the system under strong memory pressure by abruptly
allocating enough memory to run the system basically out of what
it thinks of as 'free memory'. In our experiments it was important
to do this in one fast action; otherwise the system frees up memory
through less abrupt methods and doesn't resort to what it considers
extreme measures. The simplest way to do this is with Python; look at what 'top' reports as 'free mem'
and then use up a bit more than that in one go.
(You can verify that the full freeing has triggered by using dtrace
to look for calls to kmem_reap.)
Unfortunately triggering this panic freeing of memory will likely cause your system to stall significantly. When we did it on our production fileserver we saw NFS stall for a significant amount of time, ssh sessions stop for somewhat less time, and for a while the system wasn't even responding to pings. If you have this problem and can't tolerate your system going away for five or ten minutes until things fully recover, well, you're going to need a downtime (and at that point you might as well reboot the machine).
The simple sign that your system may need this is a persistently
high 'Kernel' memory use in mdb -k's ::memstat but a low ZFS
ARC size. We saw 95% or so Kernel but ARC sizes on the order of 20
GB and of course the Kernel amount never shrunk. The more complex
sign is to look for caches in mdb's ::kmastat that have outsized
space usage and a drastic mismatch between buffers in use and buffers
allocated.
(Note that arenas for small buffers may be suffering from fragmentation instead of or in addition to this.)
I think that this isn't likely to happen on systems where you have user level programs with fluctuating overall memory usages because sooner or later just the natural fluctuation of user level programs is likely to push the system to do this panic freeing of memory. And if you use a lot of memory at the user level, well, that limits how much memory the kernel can ever use so you're probably less likely to get into this situation. Our NFS fileservers are kind of a worse case for this because they have almost nothing running at the user level and certainly nothing that abruptly wants several gigabytes of memory at once.
People who want more technical detail on this can see the illumos developer mailing list thread. Now that it's been raised to the developers, this issue is likely to be fixed at some point but I don't know when. Changes to kernel memory allocators rarely happen very fast.
2015-07-15
Mdb is so close to being a great tool for introspecting the kernel
The mdb debugger is the standard debugger on Solaris and Illumos
systems (including OmniOS). One very important aspect of mdb is
that it has a lot of support for kernel 'debugging', which for
ordinary people actually means 'getting detailed status information
out of the kernel'. For instance, if you want to know a great deal
about where your kernel memory is going you're going to want the
'::kmastat' mdb command.
Mdb is capable of some very powerful tricks
because it lets you compose its commands together in 'pipelines'.
Mdb has a large selection of things to report information (like
the aforementioned ::kmastat) and things to let you build your
own pipelines (eg walkers and ::print). All of this is great,
and far better than what most other systems have.
Where mdb sadly falls down is that this is all it has; it has no
scripting or programming language. This puts an unfortunate hard
upper bound on what you can extract from the kernel via mdb without
a huge amount of post-processing on your part. For instance, as far
as I know a pipeline can't have conditions or filtering so that you
further process only selected items that one stage of a pipeline
produces. In the case of listing file locks,
you're out of luck if you want to work on only selected files instead
of all of them.
I understand (I think) where this limitation comes from. Part of
it is probably simply the era mdb was written in (which was not
yet a time when people shoved extension languages into everything
that moved), and part of it is likely that the code of mdb is
also much of the code of the embedded kernel debugger kmdb. But
from my perspective it's also a big missed opportunity. A mdb
with scripting would let you filter pipelines and write your own
powerful information dumping and object traversal commands,
significantly extending the scope of what you could conveniently
extract from the kernel. And the presence of pipelines in mdb
show that its creators were quite aware of the power of flexibly
processing and recombining things in a debugger.
(Custom scripting also has obvious uses for debugging user level programs, where a complex program may be full of its own idioms and data structures that cry out for the equivalent of kernel dcmds and walkers.)
PS: Technically you can extend mdb by writing new mdb modules in
C, since they're just .sos that are loaded dynamically; there's
even a more or less documented module API. In practice my reaction
is 'good luck with that'.