2015-08-27
Some notes on using Solaris kstat(s) in a program
Solaris (and Illumos, OmniOS, etc) has for a long time had a 'kstat'
system for systematically providing and exporting kernel statistics
to user programs. Like many such systems in many OSes, kstat doesn't
need root permissions; all or almost all of the kstats are public
and can be read by anyone. If you're a normal Solaris sysadmin,
you've mostly interacted with this system via kstat(1) (as I
have) or perhaps Perl, for which there
is the Sun::Solaris::Kstat module. Due to me not wanting to write
Perl, I opted to do it the hard way; I wrote a program that talks
more or less directly to the C kstat library. When you do this, you
are directly exposed to some kstat concepts that kstat(1) and the
Perl bindings normally hide from you.
The stats that kstat shows you are normally given a four element
name of the form module:instance:name:statistic. This is actually
kind of a lie. A 'kstat' itself is the module:instance:name triplet,
and is a handle for a bundle of related statistics (for example,
all of the per-link network statistics exposed by a particular
network interface). When you work at the C library level, getting
a statistic is a three level process; you get a handle to the kstat,
you make sure the kstat has loaded the data for its statistics, and
then you can read out the actual statistic (how you do this depends
on what type of kstat you have).
This arrangement makes sense to a system programmer, because if we
peek behind the scenes we can see a two stage interaction with the
kernel. When you call kstat_open() to start talking with the
library, the library loads the index of all of the available kstats
from the kernel into your process but it doesn't actually retrieve
any data for them from the kernel. You only take the relatively
expensive operation of copying some kstat data from the kernel to
user space when the user asks for it. Since there are a huge number
of kstats, this cost saving is quite important.
(For example, a random OmniOS machine here has 3,627 kstats right now. How many you have will vary depending on how many network interfaces, disks, CPUs, ZFS pools, and so on there are in your system.)
A kstat can have its statistics data in several different forms.
The most common form is a 'named' kstat, where the statistics are
in a list of name=value structs. If you're dealing with this sort
of kstat, you can look up a specific named statistic in the data
with kstat_data_lookup() or just go through the whole list
manually (it's fully introspectable). The next most common form is
I/O statistics, where the data is simply a C struct (a
kstat_io_t). There are also a few kstats that return other
structs as 'raw data', but to find and understand them you get
to read the kstat(1) source code. Only named-data kstats really
have statistics with names; everyone else really just has struct
fields.
(kstat(1) and the Perl module hide this complexity from you by
basically pretending that everything is a named-data kstat.)
Once read from the kernel, kstat statistics data does not automatically
update. Instead it's up to you to update it whenever you want to,
by calling kstat_read() on the relevant kstat again. What
happens to the kstat's old data is indeterminate, but I think that
you should assume it's been freed and is no longer something you
should try to look at.
This brings us to the issue of how the kstat library manages its
memory and what bits of memory may change out from underneath you
when, which is especially relevant if you're doing something
complicated while talking to it (as I am). I believe the answer
is that kstat_read() changes the statistics data for a kstat
and may reallocate it, kstat_chain_update() may cause random
kstat structs and their data to be freed out from underneath you,
and kstat_close() obviously frees and destroys everything.
(The Perl Kstat module has relatively complicated handling of its shadow references to kstat structs after it updates the kstats chain. My overall reaction is 'there are dragons here' and I would not try to hold references to any old kstats after a kstat chain update. Restarting all your lookups from scratch is perhaps a bit less efficient, but it's sure to be safe.)
In general, once I slowly wrapped my mind around what the kstat library was doing I found it reasonably pleasant to use. As with the nvpair library, the hard part was understanding the fundamental ideas in operation. Part of this was (and is) a terminology issue; 'kstat' and 'kstats' are in common use as labels for what I'm calling a kstat statistic here, which makes it easy to get confused about what is what.
(I personally think that this is an unfortunate name choice, since 'kstat' is an extremely attractive name for the actual statistics. Life would be easier if kstats were called 'kstat bundles' or something.)
2015-08-07
One thing I now really want in ZFS: faster scrubs (and resilvers)
One of the little problems of ZFS is that scrubs and resilvers are random IO. I've always known this, but for a long time it hasn't really been important to us; things ran fast enough. As you might guess from me writing an entry about it, this situation is now changing.
We do periodic scrubs of each of our pools for safety, as everyone should; this has historically been reasonably important. Because of their impact on user IO, we only want to do them on weekends (and we only do one at a time on each fileserver). For a long time this was fine, but recently a few of our largest pools have started taking longer and longer to scrub. There are now a couple of pools that basically take the entire weekend to scrub, and that's if we're lucky. In fact the latest scrub of one these pools took three and a half days (and was only completed because Monday was a holiday and then no one complained on Tuesday).
This pool is not hulkingly huge; it's 2.91 TB of allocated space spread across eight pairs of drives. But at that scrub rate it seems pretty clear that we're being killed by random IO; the aggregate scrub data rate was down in the range of a puny 10 Mbytes/sec. Yes, the pool did see activity over the weekend, but not that much activity.
(This pool seems to be an outlier in terms of scrub time. Another pool on the same fileserver with 2.53 TB used across seven pairs of drives took only 27 hours to be scrubbed during its last check.)
One of the ZFS improvements that came with Solaris 11 is sequential resilvering (via), which apparently significantly speeds up resilvering. It's not clear to me if this also speeds up scrubbing, but I'd optimistically hope so. Of course this is only in Solaris 11; I don't think anyone in the Illumos community is currently working on this, and I imagine it's a non-trivial change that would take a decent amount of development effort. Still, I can hope. Faster scrubs are not yet a killer feature for us (we have a few tricks left up our sleeves), but they would be a big improvement for us.
(Faster resilvers by themselves would also be useful, but we fortunately do far fewer resilvers than we do scrubs.)