Wandering Thoughts archives

2015-10-19

Installing and pinning specific versions of OmniOS packages

Suppose, not entirely hypothetically, that you want to install some additional OmniOS systems with their kernels pinned to a specific version, one that you know works because it's what you're already running on your current OmniOS machine (in fact, perhaps you are having problems with a more recent kernel on your test machine). In order to do this, you're going to need to do three things: you need to find out what versions of what packages to pin, then you need to actually install those specific versions of the packages, and finally you need to keep them from changing.

Once you have the full FMRIs of the packages you want to have a specific version of when you install a new system, there are two cases: install time upgrades and downgrades. The simplest approach is upgrades; you start from an old OmniOS installer image for your OmniOS version, then upgrade the packages you need to only their specific version instead of all the way to the latest one. This is done by giving 'pkg update' the full package FMRIs on the command line; if you have a file called pkg-versions, this is:

pkg update $(cat pkg-versions)

But wait, as they say, there is a complication (at least for the kernel). Kernel upgrades require a new boot environment (and are done in that new BE) and if you're installing from an older image, you have other packages to upgrade too and some of them may also require a new boot environment. So what you really want to do is install both your specific versions and 'the latest updates to other packages' at the same time. This is done by also giving 'pkg update' an explicit "*" argument:

pkg update $(cat pkg-versions) "*"

This lets you do the whole process in a single new BE and a single reboot.

If you're installing from the very latest OmniOS installer image for your OmniOS version, you actually need to downgrade your packages. According to the documentation this is also done with 'pkg update $(cat pkg-versions)', but I haven't tested it so I don't know if it works right. My moral is save your old OmniOS installer images, partly because I trust 'upgrade only to a specific version' more than 'start with a more recent version and downgrade'

(It's possible that old installer images are still available somewhere, but I don't know where to find them. Old package versions are kept around in at least the r151014 repo and this will hopefully continue to be the case.)

Once you've actually installed your versions of the specific packages, you need to freeze them against further upgrades. This is done with the straightforward 'pkg freeze <FMRI> [...]':

pkg freeze -c "known to work kernel" $(cat pkg-versions)

However, once again boot environments get in the way. Installing our kernel packages created a new boot environment, and you have to freeze packages in the new BE, not the current one. So the easy way to go is to reboot into the new BE before you run the 'pkg freeze'.

(It's possible that you could manually mount the new BE with 'beadm mount' and then point 'pkg freeze' at it with pkg's -R argument, per here. However I haven't tested it and I honestly think it's probably simpler to reboot into the new BE first under most circumstances.)

PS: It's unfortunate that 'pkg freeze' won't freeze specific versions that are later than what you have installed. Otherwise the easy approach would be to freeze first, run a general 'pkg upgrade' (which should upgrade your packages to their frozen versions and everything else to the latest versions), and then reboot into your new BE. But that would probably make freezing more complex in the pkg code, so I can sort of see why it isn't allowed.

OmniOSPkgVersionPins written at 01:58:52; Add Comment

2015-10-15

Some notes on finding package versions in OmniOS with pkg

For reasons that don't fit within the margins of this entry, I recently had to poke around the pkg system on OmniOS in order to find out some information about packages, such as which package versions are available in the OmniOS repo, what package versions are on the system, and what packages provide certain files.

So, first, versions. pkg packages have a short version and a long one, as so:

$ pkg list kernel
NAME (PUBLISHER)  VERSION
system/kernel     0.5.11-0.151014
$ pkg list -v kernel
FMRI
pkg://omnios/system/kernel@0.5.11-0.151014:20150929T225337Z

(In all my examples, some output is condensed and fields omitted.)

As you might guess from the format of the short version, all kernel packages for OmniOS r151014 have the same short version; they differ only in the timestamp on the long version. This means that if you care about the specific kernel version for some reason you must ask for the long version.

The OmniOS r151014 repo has (at least right now) all kernel versions published for r151014, from the start onwards. You can see all of the available versions with 'pkg list -afv kernel':

$ pkg list -afv kernel
FMRI
pkg://omnios/system/kernel@0.5.11-0.151014:20150929T225337Z
pkg://omnios/system/kernel@0.5.11-0.151014:20150914T195008Z
[...]
pkg://omnios/system/kernel@0.5.11-0.151014:20150402T175237Z

If for some reason you want to install an older kernel, this is what you may need to do to find out its specific full version.

Now, the OmniOS kernel is not delivered in just the kernel package; instead there are a whole collection of packages that contain kernel drivers and other modules. So if you want 'a specific older kernel', you probably want not just the basic kernel package but all of the related drivers to be from that older kernel. This leads to the question of what installed packages on your system supply kernel drivers, and for that we turn to 'pkg contents'. To get a list of all such files along with the package names of the packages that supply them, we want:

$ pkg contents -t file -o path,pkg.name -a 'path=kernel/*'
PATH                    PKG.NAME
kernel/amd64/genunix    system/kernel
[...]
kernel/drv/aac          driver/storage/aac
[...]
kernel/drv/amd64/fct    storage/stmf
[...]
kernel/drv/amd64/fm     service/fault-management
[...]
kernel/drv/amd64/iscsi  network/iscsi/initiator
[...]
kernel/drv/amd64/zfs    system/file-system/zfs
[...]
kernel/fs/amd64/nfs     system/file-system/nfs
[...]
kernel/kmdb/amd64/arp   developer/debug/mdb
[...]

(To get long versions, ask for pkg.fmri instead of pkg.name. I've used short names because this example is already long enough.)

As this rather long example shows, packages from all over the package namespace can wind up providing kernel modules; they are by no means confined to driver/* and system/kernel* as you might innocently initially expect (although those certainly have the majority of kernel-related packages). You might wonder if the versions of all of these packages are tightly tied together so that they must be installed or updated as a set. As far as I know, the answer is that they (mostly?) aren't, apparently because Illumos has a stable kernel module API and most or all kernel modules use it. Whether or not the result works really well is an open question, but the package system itself won't prevent a mix and match situation in my brief testing.

To get just the package names (or FMRIs), we just need the second field like so:

$ pkg contents -t file -H -o path,pkg.fmri -a 'path=kernel/*' |
      awk '{print $2}' | sort -u

This will give us a nice list of specific package versions that are responsible for files under /kernel in our current system.

However, suppose that we've recently updated to the latest r151014 update but the new kernel may have problems in our testing, and what we'd really like to get is the versions of the last kernel. Since a kernel update makes a new boot environment, one option is to just reboot into the old pre-update boot environment and run these 'pkg contents' or 'pkg list' commands. But that might be disruptive to ongoing tests and it turns out that we don't need to, because we can make pkg look at alternate boot environments (although not directly).

First we need to know what boot environments we have:

# beadm list
BE                Active [...]
[...]
omnios-cslab2-2   -
omnios-cslab2-3   NR

Assuming that we want the obvious previous BE, now we need to mount it somewhere:

# beadm -s ro omnios-cslab2-2 /mnt

Now we can look at package information for this old BE by giving pkg the -R option, for example:

$ pkg -R /mnt contents [...]

When you're done, unmount the BE with 'beadm umount'.

This provides a handy and relatively non-intrusive way to recover specific package versions from an old boot environment (or, for that matter, just a list of installed packages).

Sidebar: My sources for this

Some of this is derived from information in the OmniOS wiki's general administration guide. It gives a number of 'pkg contents' examples that were quite helpful. In general 'pkg contents' can be used to do all sorts of things, it's not at all limited to mapping files to packages and packages to files.

Information about 'pkg list' and pointing pkg at alternate BEs is from Lauri Tirkkonen on omnios-discuss in answer to me asking how to do this sort of stuff.

OmniOSPkgVersionFinding written at 00:51:03; Add Comment

2015-10-09

How much space ZFS reserves in your pools varies across versions

Back in my entry on the difference in available pool space between zfs list and zpool list, I noted that one of the reasons the two differ is that ZFS reserves some amount of space internally. At the time I wrote that the code said it should be reserving 1/32nd of the pool size (and still allow some things down to 1/64th of the pool, like ZFS property changes) but our OmniOS fileservers seemed to be only reserving 1/64th of the space (and imposing a hard limit at that point). It turns out that this discrepancy has a simple explanation: ZFS has changed its behavior over time.

This change is Illumos issue 4951, 'ZFS administrative commands should use reserved space, not fail with ENOSPC', which landed in roughly July of 2014. When I wrote my original entry in late 2014 I looked at the latest Illumos source code at the time and so saw this change, but of course our ZFS fileservers were using a version of OmniOS that predated the change and so were using the old 1/64th of the pool hard limit.

The change has propagated into various Illumos distributions and other ZFS implementations at different points. In OmniOS it's in up to date versions of the r151012 and r151014 releases, but not in r151010 and earlier. In ZFS on Linux, it landed in the 0.6.5 release and was not in 0.6.4. In FreeBSD, this change is definitely in -current (and appears to have arrived very close to when it did in Illumos), but it postdates 10.0's release and I think arrived in 10.1.0.

This change has an important consequence: when you update across this change, your pools will effectively shrink, because you'll go from ZFS reserving 1/64th of their space to reserving 1/32nd of their space. If your pools have lots of space, well, this isn't a problem. If your pools have only some space, your users may notice it suddenly shrinking a certain amount (some of our pools will lose half their free space if we don't expand them). And if your pools are sufficiently close to full, they will instantly become over-full and you'll have to delete things to free up space (or expand the pool on the spot).

I believe that you can revert to the old 1/64th limit if you really want to, but unfortunately it's a global setting so you can't do it selectively for some pools while leaving others at the default 1/32nd limit. Thus, if you have to do this you might want to do so only temporarily in order to buy time while you clean up or expand pools.

(Of course, by now most people may have already dealt with this. We're a bit behind the times in terms of what OmniOS version we're using.)

Sidebar: My lesson learned here

The lesson I've learned from this is that I should probably stop reflexively reading code from the Illumos master repo and instead read the OmniOS code for the branch we're using. Going straight to the current 'master' version is a habit I got into in the OpenSolaris days, when there simply was no source tree that corresponded to the Solaris 10 update whatever that we were running. But these days that's no longer the case and I can read pretty much the real source code for what's running on our fileservers. And I should, just to avoid this sort of confusion.

(Perhaps going to the master source and then getting confused was a good thing in this case, since it's made me familiar with the new state of affairs too. But it won't always go so nicely.)

ZFSReservedSpaceVaries written at 22:23:55; Add Comment

2015-09-11

ZFS scrub rates, speeds, and how fast is fast

Here is a deceptively simple question: how do you know if your ZFS pool is scrubbing fast (or slow)? In fact, what does the speed of a scrub even mean?

The speed of a scrub is reported in the OmniOS 'zpool status' as:

  scan: scrub in progress since Thu Sep 10 22:48:31 2015
    3.30G scanned out of 60.1G at 33.8M/s, 0h28m to go
    0 repaired, 5.49% done

This is reporting the scrub's progress through what 'zpool list' reports as ALLOC space. For mirrored vdevs, this is the amount of space used before mirroring overhead; for raidz vdevs, this is the total amount of disk space used including the parity blocks. The reported rate is the total cumulative rate, ie it is simply the amount scanned divided by the time the scrub has taken so far. If you want the current scan rate, you need to look at the difference in the amount scanned between two 'zpool status' commands over some time interval (10 seconds makes for easy math, if the pool is scanning fast enough to change the 'scanned' figure).

This means that the scan rate means different things and has different maximum speeds on mirrored vdevs and on raidz vdevs. On mirrored vdevs, the scan speed is the logical scan speed; in the best case of entirely sequential IO it will top out at the sequential read speed of a single drive. The extra IO to read from all of the mirrors at once is handled below this level, so if you watch a mirrored vdev that is scrubbing at X MB/sec you'll see that all N of the drives are each going away at more or less X MB/sec. On raidz vdevs, the scan speed is the total physical scan speed of all the vdev's drives added together. If the vdev has N drives each of which can read at X MB/sec, the best case is a scan rate of N*X. If you watch a raidz vdev that is scrubbing at X MB/sec, each drive should be doing roughly X/N MB/sec of reads (at least for full-width raidz stripes).

(All of this assumes that the scrub is the only thing going on in the pool. Any other IO adds to the read rates you'll see on the disks themselves. An additional complication is that scrubs normally attempt to prefetch things like the data blocks for directories; this IO is not accounted for in the scrub rate but it will be visible if you're watching the raw disks.)

In a multi-vdev pool, it's possible (but not certain) for a scrub to be reading from multiple vdevs at once. If it is, the reported scrub rate will be the sum of the (reported) rates that the scrub can achieve on each vdev. I'm not going to try to hold forth on the conditions when this is likely, because it depends on a lot of things as far as I can tell from the kernel code. I think it's more likely when you have single objects (files, directories, etc) whose blocks are spread across multiple vdevs.

If your IO system has total bandwidth limits across all disks, this will clamp your maximum scrub speed. For raidz vdevs, the visible scrub rate will be this total bandwidth limit; for mirror vdevs, it will be the limit divided by how many mirrors you have. For example, we have a 200 MByte/sec total read bandwidth limit (since fileservers have two 1GB iSCSI links) and we use two-way mirrored vdevs, so our maximum scrub rate is always going to be around 100 MBytes/sec.

This finally gives us an answer to how you know if your scrub is fast or slow. The fastest rate a raidz scrub can report is your total disk bandwidth across all disks and the fastest rate a mirror scrub can report is your single disk bandwidth times the number of vdevs. If you're reasonably close to this (or if you've hit what you know is your system's overall disk bandwidth limit), the better. The further away from this the worse off you are, either because your scrub has descended into random IO or because you're hitting tunable limits (or both at once for extra fun).

(Much of this also applies to resilvers because scrubs and resilvers share most of the same code, but it gets kind of complicated and I haven't attempted to decode the resilver specific part of the kernel ZFS code.)

Sidebar: How scrubs issue IO (for more complexity)

Scrubs have two sorts of IO they do. For ZFS objects like directories and dnodes, the scrub actually needs to inspect the contents of the disk blocks so it tries to prefetch them and then (synchronously) reads the data through the regular ARC read paths. This IO is normal IO, does not get counted in the scrub progress report, and does not do things like check parity blocks or all mirrored copies. Then for all objects (including directories, dnodes, etc) the scrub issues a special scrub 'read everything asynchronously' read that does check parity, read all mirrors, and so on. It is this read that is counted in the 'amount scanned' stats and can be limited by various tunable parameters. Since this read is being done purely for its side effects, the scrub never waits for it and will issue as many as it can (up to various limits).

If a scrub is not running into any limits on how many of these scrub reads it can do, its ability to issue a flood of them is limited only by whether it has to wait for some disk IO in order to process another directory or dnode or whatever.

ZFSScrubSpeedNotes written at 00:20:27; Add Comment

2015-09-10

Changing kernel tunables can drastically speed up ZFS scrubs

We had another case where a pool scrub was taking a very long time this past weekend and week; on a 3 TB pool, 'zpool status' was reporting ongoing scrub rates of under 3 MB/s. This got us to go on some Internet searches for kernel tunables that might be able to speed this up. The results proved to be extremely interesting. I will cut to the punchline: with one change we got the pool scrubbing at roughly 100 Mbytes/second, which is the maximum scrub IO rate a fileserver can maintain at the moment. Also, it turns out that when I blithely asserted that our scrubs were being killed by having to do random IO I was almost certainly dead wrong.

(One reason we were willing to try changing tunable parameters on a live production system was that this pool was scrubbing so disastrously slow that we were seriously worried about resilver times for it if it ever needed a disk replacement.)

The two good references we immediately found for tuning ZFS scrubs and resilvers are this serverfault question and answer and ZFS: Performance Tuning for Scrubs and Resilvers. Rather than change all of their recommended parameters at once, I opted to make one change at a time and observe the effects (just in case a change caused the server to choke). The first change I made was to set zfs_scrub_delay to 0; this immediately accelerated the scrub rate to 100 Mbytes/sec.

Let's start with a quote from the code in dsl_scan.c:

int zfs_scrub_delay = 4;     /* number of ticks to delay scrub */
int zfs_scan_idle = 50;      /* idle window in clock ticks */

How these variables are used is that every time a ZFS scrub is about to issue a read request, it checks to see if some normal read or write IO has happened within zfs_scan_idle ticks. If it has, it delays zfs_scrub_delay ticks before issuing the IO or doing anything else. If your pool is sufficiently busy to hit this condition more or less all of the time, ZFS scrubs will only be able to make at most a relatively low number of reads a second; if HZ is how many ticks in a second, the issue rate is HZ / 4 by default. In standard OmniOS kernels, HZ is almost always 100; that is, there are 100 ticks a second. If your regular pool users are churning around enough to do one actual IO every half a second, your scrubs are clamped to no more than 25 reads a second. If each read is for a full 128 KB ZFS block, that's a scrub rate of about 3.2 MBytes/sec at most (and there are other things that can reduce it, too).

Setting zfs_scrub_delay to 0 eliminates this clamping of scrub reads in the face of other IO; instead your scrub is on much more equal footing with active user IO. Unfortunately you cannot set it to any non-zero value lower than 1 tick, and 1 tick will clamp you to 100 reads a second, which is probably not fast enough for many people.

This does not eliminate slowness due to scrubs (and resilvers) potentially having to do a lot of random reads, so it will not necessarily eliminate all of your scrub speed problems. But if you have a pool that seems to scrub at a frustratingly variable speed, sometimes scrubbing in a day and sometimes taking all week, you are probably running into ZFS scrubs backing off in the face of other IO and it's worth exploring this tunable and the others in those links.

On the other tunables, I believe that it's relatively harmless and even useful to tune up zfs_scan_min_time_ms, zfs_resilver_min_time_ms, and zfs_top_maxinflight. Certainly I saw no problems on our server when I set zfs_scan_min_time_ms to 5000 and increased zfs_top_maxinflight. However I can't say for sure that it's useful, as our scrub rate had already hit its maximum rate from just the zfs_scrub_delay change.

(And I'm still reading the current Illumos ZFS kernel code to try to understand what these additional tunables really do and mean.)

Sidebar: How to examine and set these tunable variables

To change kernel tunables like this, you need to use 'mdb -kw' to enable writing to things. To see their value, I recommend using '::print', eg:

> zfs_scrub_delay ::print -d
4
> zfs_scan_idle ::print -d
0t50

To set the value, you should to use /W, not the /w that ZFS: Performance Tuning for Scrubs and Resilvers says. The w modifier is for 2-byte shorts, not 4-byte ints, and all of these variables are 4-byte ints (as you can see with '::print -t' and '::sizeof' if you want). A typical example is:

> zfs_scrub_delay/W0
zfs_scrub_delay:0x4 = 0x0

The /W stuff accepts decimal numbers as '0tNNNN' (as '::print -d' shows them, unsurprisingly), so you can do things like:

> zfs_scan_min_time_ms/W0t5000
zfs_scan_min_time_ms: 0x3e8 = 0x1388

(Using '/w' will work on the x86 because the x86 is a little-endian architecture, but please don't get into the habit of doing that. My person view is that if you're going to be poking values into kernel memory it's very much worth being careful about doing it right.)

ZFSScrubsOurSpeedup written at 01:33:14; Add Comment

2015-09-06

Optimizing finding unowned files on our ZFS fileservers

One of the things we do every weekend is look for files on our fileservers that have wound up being owned by people who don't exist (or, more commonly, who no longer exist). For a long time this was done with the obvious approach using find, which was basically this:

SFS=$(... generate FS list ...)
gfind -H $SFS -mount '('  -nogroup -o -nouser ')' -printf ...

The problem with this is that we have enough data in enough filesystems that running a find over the entire thing can take a significant amount of time. On our biggest fileserver, we've seen this take on the order of ten hours, which either delays the start of our weekly pool scrubs or collides with them, slowing them down (and they can already be slow enough). Recently I realized that we can do much better than this by not checking most of our filesystems.

The trick is to use ZFS's existing infrastructure for quotas. As part of this ZFS maintains information on the amount of space used by every user and every group on each filesystems, which the 'zfs userspace' and 'zfs groupspace' commands will print out. As a side effect this gives you a complete list of every UID and GID that uses space in the filesystem, so all we have to do is scan the lists to see if there are any unknown ones in it. If all UIDs and GIDs using space on the filesystem exist, we can completely skip running find on it; we know our find won't find anything.

Since our filesystems don't normally have any unowned files on them, this turns into a massive win. In the usual case we won't scan any filesystems on a fileserver, and even if we do scan some we'll generally only scan a handful. It may even make this particular process fast enough so that we can just run it after deleting accounts, instead of waiting for the weekend.

By the way, the presence of unknown UIDs or GIDs in the output of 'zfs *space' doesn't mean that there definitely are files that a find will pick up. The unowned files could be only in a snapshot, or they could be deleted files that are being held open by various things, including the NFS lock manager.

ZFSOptimizeFindUnowned written at 01:10:49; Add Comment

2015-08-27

Some notes on using Solaris kstat(s) in a program

Solaris (and Illumos, OmniOS, etc) has for a long time had a 'kstat' system for systematically providing and exporting kernel statistics to user programs. Like many such systems in many OSes, kstat doesn't need root permissions; all or almost all of the kstats are public and can be read by anyone. If you're a normal Solaris sysadmin, you've mostly interacted with this system via kstat(1) (as I have) or perhaps Perl, for which there is the Sun::Solaris::Kstat module. Due to me not wanting to write Perl, I opted to do it the hard way; I wrote a program that talks more or less directly to the C kstat library. When you do this, you are directly exposed to some kstat concepts that kstat(1) and the Perl bindings normally hide from you.

The stats that kstat shows you are normally given a four element name of the form module:instance:name:statistic. This is actually kind of a lie. A 'kstat' itself is the module:instance:name triplet, and is a handle for a bundle of related statistics (for example, all of the per-link network statistics exposed by a particular network interface). When you work at the C library level, getting a statistic is a three level process; you get a handle to the kstat, you make sure the kstat has loaded the data for its statistics, and then you can read out the actual statistic (how you do this depends on what type of kstat you have).

This arrangement makes sense to a system programmer, because if we peek behind the scenes we can see a two stage interaction with the kernel. When you call kstat_open() to start talking with the library, the library loads the index of all of the available kstats from the kernel into your process but it doesn't actually retrieve any data for them from the kernel. You only take the relatively expensive operation of copying some kstat data from the kernel to user space when the user asks for it. Since there are a huge number of kstats, this cost saving is quite important.

(For example, a random OmniOS machine here has 3,627 kstats right now. How many you have will vary depending on how many network interfaces, disks, CPUs, ZFS pools, and so on there are in your system.)

A kstat can have its statistics data in several different forms. The most common form is a 'named' kstat, where the statistics are in a list of name=value structs. If you're dealing with this sort of kstat, you can look up a specific named statistic in the data with kstat_data_lookup() or just go through the whole list manually (it's fully introspectable). The next most common form is I/O statistics, where the data is simply a C struct (a kstat_io_t). There are also a few kstats that return other structs as 'raw data', but to find and understand them you get to read the kstat(1) source code. Only named-data kstats really have statistics with names; everyone else really just has struct fields.

(kstat(1) and the Perl module hide this complexity from you by basically pretending that everything is a named-data kstat.)

Once read from the kernel, kstat statistics data does not automatically update. Instead it's up to you to update it whenever you want to, by calling kstat_read() on the relevant kstat again. What happens to the kstat's old data is indeterminate, but I think that you should assume it's been freed and is no longer something you should try to look at.

This brings us to the issue of how the kstat library manages its memory and what bits of memory may change out from underneath you when, which is especially relevant if you're doing something complicated while talking to it (as I am). I believe the answer is that kstat_read() changes the statistics data for a kstat and may reallocate it, kstat_chain_update() may cause random kstat structs and their data to be freed out from underneath you, and kstat_close() obviously frees and destroys everything.

(The Perl Kstat module has relatively complicated handling of its shadow references to kstat structs after it updates the kstats chain. My overall reaction is 'there are dragons here' and I would not try to hold references to any old kstats after a kstat chain update. Restarting all your lookups from scratch is perhaps a bit less efficient, but it's sure to be safe.)

In general, once I slowly wrapped my mind around what the kstat library was doing I found it reasonably pleasant to use. As with the nvpair library, the hard part was understanding the fundamental ideas in operation. Part of this was (and is) a terminology issue; 'kstat' and 'kstats' are in common use as labels for what I'm calling a kstat statistic here, which makes it easy to get confused about what is what.

(I personally think that this is an unfortunate name choice, since 'kstat' is an extremely attractive name for the actual statistics. Life would be easier if kstats were called 'kstat bundles' or something.)

KStatProgrammingNotes written at 02:02:26; Add Comment

2015-08-07

One thing I now really want in ZFS: faster scrubs (and resilvers)

One of the little problems of ZFS is that scrubs and resilvers are random IO. I've always known this, but for a long time it hasn't really been important to us; things ran fast enough. As you might guess from me writing an entry about it, this situation is now changing.

We do periodic scrubs of each of our pools for safety, as everyone should; this has historically been reasonably important. Because of their impact on user IO, we only want to do them on weekends (and we only do one at a time on each fileserver). For a long time this was fine, but recently a few of our largest pools have started taking longer and longer to scrub. There are now a couple of pools that basically take the entire weekend to scrub, and that's if we're lucky. In fact the latest scrub of one these pools took three and a half days (and was only completed because Monday was a holiday and then no one complained on Tuesday).

This pool is not hulkingly huge; it's 2.91 TB of allocated space spread across eight pairs of drives. But at that scrub rate it seems pretty clear that we're being killed by random IO; the aggregate scrub data rate was down in the range of a puny 10 Mbytes/sec. Yes, the pool did see activity over the weekend, but not that much activity.

(This pool seems to be an outlier in terms of scrub time. Another pool on the same fileserver with 2.53 TB used across seven pairs of drives took only 27 hours to be scrubbed during its last check.)

One of the ZFS improvements that came with Solaris 11 is sequential resilvering (via), which apparently significantly speeds up resilvering. It's not clear to me if this also speeds up scrubbing, but I'd optimistically hope so. Of course this is only in Solaris 11; I don't think anyone in the Illumos community is currently working on this, and I imagine it's a non-trivial change that would take a decent amount of development effort. Still, I can hope. Faster scrubs are not yet a killer feature for us (we have a few tricks left up our sleeves), but they would be a big improvement for us.

(Faster resilvers by themselves would also be useful, but we fortunately do far fewer resilvers than we do scrubs.)

ZFSFasterScrubsDesire written at 02:32:18; Add Comment

2015-07-20

The OmniOS kernel can hold major amounts of unused memory for a long time

The Illumos kernel (which means the kernels of OmniOS, SmartOS, and so on) has an oversight which can cause it to hold down a potentially large amount of unused memory in unproductive ways. We discovered this on our most heavily used NFS fileserver; on a server with 128 GB of RAM, over 70 GB of RAM was being held down by the kernel and left idle for an extended time. As you can imagine, this didn't help the ZFS ARC size, which got choked down to 20 GB or so.

The problem is in kmem, the kernel's general memory allocator. Kmem is what is called a slab allocator, which means that it divides kernel memory up into a bunch of arenas for different-sized objects. Like basically all sophisticated allocators, kmem works hard to optimize allocation and deallocation; for instance, it keeps a per-CPU cache of recently freed objects so that in the likely case that you need an object again you can just grab it in a basically lock free way. As part of these optimizations, kmem keeps a cache of fully empty slabs (ones that have no objects allocated out of them) that have been freed up; this means that it can avoid an expensive trip to the kernel page allocator when you next want some more objects from a particular arena.

The problem is that kmem does not bound the size of this cache of fully empty slabs and does not age slabs out of it. As a result, a temporary usage surge can leave a particular arena with a lot of unused objects and slab memory, especially if the objects in question are large. In our case, this happened to the arena for 'generic 128 KB allocations'; we spent a long time with around six in use but 613,033 allocated. Presumably at one time we needed that ~74 GB of 128 KB buffers (probably because of a NFS overload situation), but we certainly didn't any more.

Kmem can be made to free up these unused slabs, but in order to do so you must put the system under strong memory pressure by abruptly allocating enough memory to run the system basically out of what it thinks of as 'free memory'. In our experiments it was important to do this in one fast action; otherwise the system frees up memory through less abrupt methods and doesn't resort to what it considers extreme measures. The simplest way to do this is with Python; look at what 'top' reports as 'free mem' and then use up a bit more than that in one go.

(You can verify that the full freeing has triggered by using dtrace to look for calls to kmem_reap.)

Unfortunately triggering this panic freeing of memory will likely cause your system to stall significantly. When we did it on our production fileserver we saw NFS stall for a significant amount of time, ssh sessions stop for somewhat less time, and for a while the system wasn't even responding to pings. If you have this problem and can't tolerate your system going away for five or ten minutes until things fully recover, well, you're going to need a downtime (and at that point you might as well reboot the machine).

The simple sign that your system may need this is a persistently high 'Kernel' memory use in mdb -k's ::memstat but a low ZFS ARC size. We saw 95% or so Kernel but ARC sizes on the order of 20 GB and of course the Kernel amount never shrunk. The more complex sign is to look for caches in mdb's ::kmastat that have outsized space usage and a drastic mismatch between buffers in use and buffers allocated.

(Note that arenas for small buffers may be suffering from fragmentation instead of or in addition to this.)

I think that this isn't likely to happen on systems where you have user level programs with fluctuating overall memory usages because sooner or later just the natural fluctuation of user level programs is likely to push the system to do this panic freeing of memory. And if you use a lot of memory at the user level, well, that limits how much memory the kernel can ever use so you're probably less likely to get into this situation. Our NFS fileservers are kind of a worse case for this because they have almost nothing running at the user level and certainly nothing that abruptly wants several gigabytes of memory at once.

People who want more technical detail on this can see the illumos developer mailing list thread. Now that it's been raised to the developers, this issue is likely to be fixed at some point but I don't know when. Changes to kernel memory allocators rarely happen very fast.

KernelMemoryHolding written at 01:55:20; Add Comment

2015-07-15

Mdb is so close to being a great tool for introspecting the kernel

The mdb debugger is the standard debugger on Solaris and Illumos systems (including OmniOS). One very important aspect of mdb is that it has a lot of support for kernel 'debugging', which for ordinary people actually means 'getting detailed status information out of the kernel'. For instance, if you want to know a great deal about where your kernel memory is going you're going to want the '::kmastat' mdb command.

Mdb is capable of some very powerful tricks because it lets you compose its commands together in 'pipelines'. Mdb has a large selection of things to report information (like the aforementioned ::kmastat) and things to let you build your own pipelines (eg walkers and ::print). All of this is great, and far better than what most other systems have. Where mdb sadly falls down is that this is all it has; it has no scripting or programming language. This puts an unfortunate hard upper bound on what you can extract from the kernel via mdb without a huge amount of post-processing on your part. For instance, as far as I know a pipeline can't have conditions or filtering so that you further process only selected items that one stage of a pipeline produces. In the case of listing file locks, you're out of luck if you want to work on only selected files instead of all of them.

I understand (I think) where this limitation comes from. Part of it is probably simply the era mdb was written in (which was not yet a time when people shoved extension languages into everything that moved), and part of it is likely that the code of mdb is also much of the code of the embedded kernel debugger kmdb. But from my perspective it's also a big missed opportunity. A mdb with scripting would let you filter pipelines and write your own powerful information dumping and object traversal commands, significantly extending the scope of what you could conveniently extract from the kernel. And the presence of pipelines in mdb show that its creators were quite aware of the power of flexibly processing and recombining things in a debugger.

(Custom scripting also has obvious uses for debugging user level programs, where a complex program may be full of its own idioms and data structures that cry out for the equivalent of kernel dcmds and walkers.)

PS: Technically you can extend mdb by writing new mdb modules in C, since they're just .sos that are loaded dynamically; there's even a more or less documented module API. In practice my reaction is 'good luck with that'.

MdbScriptingWish written at 02:50:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.