'Scanned' versus 'issued' numbers for ZFS scrubs (and resilvers)
Sufficiently recent versions of ZFS have new '
zpool status' output
during scrubs and resilvers. The traditional old output looks like:
scan: scrub in progress since Sat Feb 9 18:30:40 2019 125G scanned out of 1.74T at 1.34G/s, 0h20m to go 0B repaired, 7.02% done
(As you can probably tell from the IO rate, this is a SSD-based pool.)
The new output adds an additional '<X> issued at <RATE>' note in the second line, and in fact you can get some very interesting output in it:
scan: scrub in progress since Sat Feb 9 18:36:33 2019 215G scanned at 2.24G/s, 27.6G issued at 294M/s, 215G total 0B repaired, 12.80% done, 0 days 00:10:54 to go
Or (with just the important line):
271G scanned at 910M/s, 14.5G issued at 48.6M/s, 271G total
In both cases, this claims to have 'scanned' the entire pool but has only 'issued' a much smaller amount of IO. As it turns out, this is a glaring clue as to what is going on, which is that these are the new sequential scrubs in action. Sequential scrubs (and resilvers) split the non-sequential process of scanning the pool into two sides, scanning through metadata to figure out what IOs to issue and then, separately, issuing the IOs after they have been sorted into order (I am pulling this from this presentation, via). A longer discussion of this is in the comment at the start of ZFS on Linux's dsl_scan.c.
This split is what the new 'issued' number is telling you about.
In sequential scrubs and resilvers, 'scanned' is how much metadata
and data ZFS has been able to consider and queue up IO for, while
'issued' is how much IO has been actively queued to vdevs. Note
that it is not physical IO; instead it is progress through what
zpool list' reports as
ALLOC space, as covered in my entry
on ZFS scrub rates and speeds.
(All of these pools I'm showing output from use mirrored vdevs, so the actual physical IO is twice the 'issued' figures.)
As we can see from these examples, it is possible for ZFS to completely 'scan' your pool before issuing much IO. This is generally going to require that your pool is relatively small and also that you have a reasonable amount of memory, because ZFS limits how much memory it will use for all of those lists of not yet issued IOs that it is sorting into order. Once your pool is fully scanned, the reported scan rate will steadily decay, because it's computed based on the total time the scrub or resilver has been running, not the amount of time that ZFS took to hit 100% scanned.
(In the current ZFS on Linux code, this memory limit appears to be a per-pool one. On the one hand this means that you can scan several pools at once without one pool limiting the others. On the other hand, this means that scanning multiple pools at once may use more memory than you're expecting.)
Sequential scrubs and resilvers are in FreeBSD 12 and will appear in ZFS on Linux 0.8.0 whenever that is released (ZoL is currently at 0.8.0-rc3). It doesn't seem to be in Illumos yet, somewhat to my surprise.
A bit of Sun's history that still lingers on in Illumos
uname command (and system call)
exist to give you various information about the machine you're on.
For example, what Unix it runs, which is handy if you have scripts
(or programs) that are run on multiple Unixes where you want to do
The result from '
uname -s', the name of the operating system, is
pretty straightforward (unlike some of the other uname options; go
ahead, try to guess what '
uname -i' is going to give you on a
random Unix). On FreeBSD you get
FreeBSD, on OpenBSD you get
OpenBSD, on Linux you get
Linux or, if you insist with '
GNU/Linux. On OmniOS and in fact any Illumos system, well:
$ uname -s SunOS
Once upon a time there was Sun Microsystems, who made some of the first Unix workstations. Their Unix was a version of BSD Unix, and like basically every early Unix company they couldn't actually call it 'Unix' for various reasons. So they called it SunOS, and it had a storied history that is too long to cover here (especially SunOS 3.x and 4.x). It of course identified itself as 'SunOS' in various things, because that was its name.
In the early 1990s, Sun changed the name of their Unix from SunOS
to Solaris at the same time as they replaced the code base with one
based on System V Release 4 (which they had
had a hand in creating). Okay, officially 'SunOS 5' was there as a
component of this Solaris thing, but good luck finding much mention
of that or very many people who considered 'SunOS 5' to be a
continuation of SunOS. However, '
uname -s' (still) reported
SunOS', possibly because of that marketing decision.
(I'm not sure if SunOS 3 or SunOS 4 had a
uname command, since
it came from System V. By the way, this history of 'SunOS 5' being
the base component of Solaris is probably why '
uname -r' reports
the release of Illumos as '
5.11' instead of '
Once the early versions of Solaris reported themselves to be
SunOS', Sun was stuck with it in the name of backward compatibility.
Scripts and programs that wanted to check for Solaris knew to check
for a OS name of
SunOS and then a '
uname -r' of 5.* (and
as SunOS 4 faded away people stopped bothering with the second
check); changing the reported operating system name would break
them all. No one was going to do that, especially not Sun.
When OpenSolaris spawned from Solaris, of course the '
output had to stay the same. When OpenSolaris became Illumos, the
same thing was true. And so today, our OmniOS machines cheerfully
tell us that they're running SunOS, an operating system name that
is now more than 30 years old. It's a last lingering trace of a
company that changed the world.
(In Illumos, this is hard-coded in uts/common/os/vers.c.)
(I was reminded of all of this recently as I was changing one of
our fileserver management scripts so that it would refuse to run
on anything except our OmniOS fileservers. Checking '
is really the correct way to do this, which caused me to actually
run it on our OmniOS machines for the first time in a while.)
The potential risk to ZFS created by the shift in its userbase
The obvious conclusion we can draw from FreeBSD ZFS's shift to being based on ZFS on Linux is that the center of gravity of open source ZFS development has shifted to ZFS on Linux. FreeBSD ZFS is shifting its upstream because ZFS on Linux is increasingly where new development happens. A more imprecise and deeper conclusion is that in some sense, the ZFS userbase as a whole is increasingly shifting to Linux.
(It may be that the total numbers of people using ZFS on FreeBSD and Illumos is higher than the people using ZFS on Linux. But if so, the users of FreeBSD and Illumos ZFS don't seem to drive development in the way that happens with ZFS on Linux. There are many possible explanations for this, of course, because there are many factors involved.)
Unfortunately, I think that this shift creates risk due to a consequence of ZFS on Linux not being GPL-compatible, which is that working with the Linux kernel's ever-changing API could become sufficiently much of a problem that ZFS on Linux development could end up coming to a halt. If ZFS as a whole is increasingly dependent on ZFS on Linux development and ZFS on Linux development is partly a hostage to the not particularly friendly views of Linux kernel developers, a problem there affects not just ZFS on Linux but everyone.
At least some of the users of ZFS on Linux are OS-agnostic and are using Linux because they feel it is their best overall choice (that's certainly our view for our new fileservers). If ZoL ground to a halt and stopped being viable, they'd switch over to other OSes (FreeBSD for us). But the question there is whether they'd bring the development resources with them, so that people would do and fund more ZFS development on non-Linux platforms, or if some of the work that is currently being done to ZoL would basically just evaporate.
(One of the things I wonder about is if ZFS on Linux's existence outside the kernel has helped drive development of it. If you're interested in a new ZoL feature, you can develop or fund it on your own schedule, then probably deploy it even on old Linux kernels such as those found in the various 'long term support' Linuxes (because ZoL as a whole is careful to support them). In FreeBSD and Illumos, I think you're much more locked to the overall kernel version and thus the OS version; if you fund a new ZFS feature, you could wind up needing to deploy a whole new OS release to get what you want. If nothing else, this is a longer deployment cycle.)
Given this risk, I definitely hope that people keep doing ZFS work (and funding it) outside of ZFS on Linux. But unfortunately I'm not sure how likely that is. It's quite possible that there are plenty of other people like us, making a pragmatic choice that ZoL is currently good enough and means we can use Linux instead of having to build out and maintain another OS.
(Illumos is not viable for us today for various reasons. FreeBSD might have been, but we didn't try to evaluate it since Linux worked well enough and we already run lots of other Linux machines.)
Some things on ZFS's per-pool performance statistics
Somewhat to my surprise, I recently found out that ZFS has had basic
per-pool activity and performance statistics for a while (they're
old enough that they're in our version of OmniOS, which is not
exactly current these days). On sufficiently modern versions of ZFS
(currently only the development version of ZFS on Linux), you can even get a small subset of these
per-pool stats for each separate dataset, which may be useful for
tracking activity. To be clear, these are not the stats that are
made visible through '
zpool iostat'; these are a separate set of
stats that are visible through (k)stats, and which can be picked
up and tracked by at least some performance metrics systems on some
(Specifically, I know that Prometheus's host agent can collect them from
ZFS on Linux. In theory you could add
collecting them on OmniOS to the agent, perhaps using my Go kstat
package, but someone would
have to go to that programming work and I don't know if the Prometheus
people would accept it. I haven't looked at other metrics host
agents to see if they can collect this information on OmniOS or
other Illumos systems. I suspect that no host agent can collect the
more detailed '
zpool iostat' statistics because as far as I know
there's no documented API to obtain them programatically.)
The best description of the available stats is in the kstat(3kstat) manpage, in the description of 'I/O Statistics' (for ZFS on Linux, you want to see here). Most of the stats are relatively obvious, but there's two important things to note for them. First, as far as I know (and can tell), these are for direct IO to disk, not for user level IO. This shows up in particular for writes if you have any level of vdev redundancy. Since we use mirrors, the amount written is basically twice the user level write rate; if a user process is writing at 100 MB/sec, we see 200 MB/sec of writes. This is honest but a little bit confusing for keeping track of user activity.
The second is that the
stats are not distinguishing between read and write IO, but between
'run' and 'wait' IO. This is best explained in a comment from the
Illumos kstat manpage:
A large number of I/O subsystems have at least two basic "lists" of transactions they manage: one for transactions that have been accepted for processing but for which processing has yet to begin, and one for transactions which are actively being processed (but not done). For this reason, two cumulative time statistics are defined here: pre-service (wait) time, and service (run) time.
I don't know enough about the ZFS IO queue code to be able to tell you if a ZFS IO being in the 'run' state only happens when it's actively submitted to the disk (or other device), or if ZFS has some internal division. The ZFS code does appear to consider IO 'active' at the same point as it makes it 'running', and based on things I've read about the ZIO scheduler I think this is probably at least close to 'it was issued to the device'.
(On Linux, 'issued to the device' really means 'put in the block IO system'. This may or may not result in it being immediately issued to the device, depending on various factors, including how much IO you're allowing ZFS to push to the device.)
If you have only moderate activity, it's routine to have little or no 'wait' (w) activity or time, with most of the overall time for request handling being in the 'run' time. You will see 'wait' time (and queue sizes) rise as your ZFS pool IO level rises, but before then you can have an interesting and pretty pattern where your average 'run' time is a couple of milliseconds or higher but your average 'wait' time is in the microseconds.
In terms of Linux disk IO stats, the
*time stats are the equivalent of the
use stat, and the
*lentime stats are the equivalent of the
aveq field. There
is no equivalent of the Linux
wuse fields, ie no
field that gives you the total time taken by all completed 'wait'
or 'run' IO. I think that there's ways to calculate much of the
same information you can get for Linux disk IO
from what ZFS (k)stats give you, but that's another entry.
For ZFS datasets, you currently get only
nwritten. For datasets, the writes appear to be
user-level writes, not low level disk IO writes, so they will track
closely with the amount of data written at the user level (or at
the level of an NFS server). As I write this here in early 2019,
these per-dataset stats aren't in any released version of even ZFS
on Linux, but I expect to see them start showing up in various
places (such as FreeBSD) before too many
years go by.
PS: I regret not knowing that these stats existed some time ago,
because I probably would have hacked together something to look at
them on our OmniOS machines, even though
we never used '
zpool iostat' very much for troubleshooting for
various reasons. In general, if you have multiple ZFS pools it's
always useful to be able to see what the active things are at the
Some notes on ZFS prefetch related stats
For reasons beyond the scope of this entry, I've recently been looking at ARC stats, in part using a handy analysis program by Richard Elling. This has gotten me looking again at ZFS (k)stats related the prefetching, which I touched on before in my entry on some basic ZFS ARC statistics and prefetching. So here are some notes on what I think these mean or might mean.
To be able to successfully prefetch at all, ZFS needs to recognize and predict your access pattern. The extent to which it can do this is visible in the ZFS DMU zfetchstats kstats; zfetchstats:hits is the number of reads that matched a prediction stream, while zfetchstats:misses is the number of reads that did not match one. If zfetchstats:hits is low, there are two possible reasons; you could have a mostly random IO pattern, or you could have too many different sequential streams reading from the same file(s) at once. In theory there is a kstat that counts 'you had too many streams for this file and I couldn't create a new one', zfetchstats:max_streams. In practice this seems to be useless and you can't really tell these cases apart, because as far as I can tell even random access to files creates ZFS prefetch streams.
Every file can have at most
zfetch_max_streams streams (default
8), and even streams that have never matched any reads aren't removed
zfetch_min_sec_reap seconds (default 2). So when you start
doing random reads to a new file, as far as I can tell your first
8 random reads will immediately create 8 DMU prefetch streams and
then every read after that will still try to create a new one but
fail because you've hit the maximum stream count for the file. Since
the streams are maxed out, each new random read will increment both
zfetchstats:misses (since it doesn't match any existing stream) and
zfetchstats:max_streams (since the file has 8 streams). Every
two seconds, your current streams expire and you get 8 new ones
from the next 8 random reads you do.
(This theory matches the numbers I see when I produce a flood of
random reads to a large file with
ioping. Our ZFS fileservers do
show a slowly growing difference between the two stats.)
As discussed in my previous entry, the ARC 'prefetch hits' statistics count only how many prefetch reads were found in the ARC instead of needing to be read from disk. A high prefetch ARC hit rate means that you're doing sequential reads of files that are already in the ARC and staying there (either because you've read them before or because you recently wrote them). A low prefetch ARC hit rate means that this isn't happening, but there are multiple reasons for this. Obviously, one cause is that your sequential re-reads are collectively too large for your ARC and so at least some of them are being evicted before you re-read them. Another cause is that you're mostly not re-reading things, at least not very soon; most of the time you read a file once and then move on.
If you know or believe that your workload should be in the ARC, a low ARC prefetch hit rate or more exactly a high ARC prefetch miss count is a sign that something is off, since it means that your prefetch reads are not finding things in the ARC that you expect to be there. A low ARC prefetch hit rate is not necessarily otherwise a problem.
I believe that there are situations where you will naturally get a
low ARC prefetch hit rate. For example, if you perform a full backup
of a number of ZFS filesystems with
tar, I would expect a lot of
ARC prefetch misses, since it's unlikely that you can fit all the
data from all of your filesystems into ARC. And this is in fact the
pattern we see on our ZFS fileservers during our Amanda backups.
On the other hand, you should see a lot of ARC demand data hits,
since prefetching itself should be very successful (and this is
also the pattern we see).
FreeBSD ZFS will be changing to be based on ZFS on Linux
The big news today is contained in an email thread called The future of ZFS in FreeBSD (via any number of places), where the crucial summary is:
[...] This state of affairs has led to a general agreement among the stakeholders that I have spoken to that it makes sense to rebase FreeBSD's ZFS on [ZFS on Linux]. [...]
This is not a development that I expected, to put it one way. Allan Jude had some commentary on OpenZFS in light of this that's worth reading for interested parties. There's also a discussion of this with useful information on lobse.rs, including this comment from Joshua M. Clulow mentioning that Illumos developers are likely to be working to upstream desirable ZFS on Linux changes.
(I had also not known that Delphix is apparently moving away from Illumos.)
I've used ZFS on Linux for a fairly long time now, and we settled on it for our next generation of fileservers, but all during that time I was under the vague impression that ZFS on Linux was the red-headed stepchild of ZFS implementations. Finding out that it's actually a significant source of active development is a bit of a surprise. On the one hand, it's a pleasant surprise, and it definitely helps me feel happy about our choice of ZFS on Linux. On the other hand it makes me feel a bit sad about Illumos, because upstream ZFS development in Illumos now appears to be yet another thing that is getting run over by the Linux juggernaut. Even for Linux people, that is not entirely good news.
(And, well, we're part of that Linux juggernaut, since we chose ZFS on Linux instead of OmniOS CE. For good reasons, sadly.)
On the other hand, if this leads to more feature commonality and features becoming available sooner in FreeBSD, Illumos, and ZFS on Linux, it will be a definite win. Even if it simply causes everyone to wind up with an up to date test suite with shared tests for common features that run regularly, that's probably a general win (and certainly a good thing).
(One thing that it may do in the short term is propagate FreeBSD's
support for ZFS sending
TRIM commands to disks into ZFS on Linux,
which would be useful for us in the long run.)
PS: The fundamental problem with ZFS on Linux remains that ZFS on Linux will almost certainly never be included in the Linux kernel. This means that it will always be a secondary filesystem on Linux. As a result I think it's not really a great thing if ZFS on Linux becomes the major source of ZFS development; the whole situation with Linux and ZFS is always going to be somewhat precarious, and things could always go wrong. It's simply healthier for ZFS overall if significant ZFS development continues on a Unix where ZFS is a first class citizen, ideally fully integrated into the kernel and fully supported.
Some basic ZFS ARC statistics and prefetching
I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?
(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)
If you read the
or look at the raw ARC kstats, you'll very soon notice things like
'prefetch hits percentage' arcstat field and a prefetch_data_hits
kstat. Unfortunately these prefetch-related kstats will not give
us what we want, because they mean something different that what
you might expect.
The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.
(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)
All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:
- The ARC demand hit rate (for data and metadata together) is probably
the most important thing for whether the ARC is giving you good
results, although this partly depends on the absolute volume of
demand reads and a few other things. Demand misses mean that
programs are waiting on disk IO.
- The breakdown of *_miss kstats will tell you why ZFS is
reading things from disk. You would generally like this to
be prefetch reads instead of demand reads, because at least
things aren't waiting on prefetch reads.
- The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.
One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.
Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:
- demand_hit_predictive_prefetch counts demand reads that found
data in the ARC from prefetch reads. This sounds exactly like what
we want, but experimentally it doesn't seem to come anywhere near
fully accounting for hits on prefetched data; I see low rates of
it when I am also seeing a 100% demand hit rate for sequentially
read data that was not previously in the ARC.
- sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.
My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.
If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.
If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.
One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.
(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)
Today I (re-)learned that
top's output can be quietly system dependent
I'll start with a story that is the background. A few days ago I tweeted:
Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.
(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)
Try adding '-c email@example.com' to your SSH invocation.
See also: <pdf link>
(If you care about 10G+ SSH, you want to read that PDF.)
This made a huge difference, giving
me basically 1G wire speeds for my ZFS transfers. But that difference
made me scratch my head, because why was switching SSH ciphers
making a difference when
ssh wasn't CPU-limited in the first
place? I came up with various theories and guesses, until today I
had a sudden terrible suspicion. The result of testing and confirming
that suspicion was another tweet:
Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.
(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)
You see, the reason that I had confidently known that SSH wasn't
CPU-limited on sending machine, which was one of our OmniOS
fileservers, is that I had run
seen that the
ssh process was only using 25% of the CPU. Case
Except that OmniOS
top and Linux's
top report CPU usage percentages
differently. On Linux, CPU percentage is relative to a single CPU,
so 25% is a quarter of one CPU, 100% is all of it, and over 100%
is a multi-threaded program that is using up more than one CPU's
worth of CPU time. On OmniOS, the version of
top we're using comes
from pkgsrc (in what is by now a very
old version), and that version reports CPU percentage relative to
all CPUs in the machine. Our OmniOS fileservers are 4-CPU
so that '25% CPU' was actually 'all of a single CPU'. In other words,
I was completely wrong about the sending
ssh not being CPU-limited.
ssh was CPU limited after all, it's suddenly no surprise why
switching ciphers sped things up to basically wire speed.
(Years ago I established that the old SunSSH that OmniOS was using
back then was rather slow, but then later we
upgraded to OpenSSH and I sort of thought that
I could not worry about SSH speeds any more. Well, I was wrong. Of
course, nothing can beat not doing SSH at all but instead using, say,
also means that you can deliberately limit your transfer bandwidth
to leave some room for things like NFS fileservice.)
PS: There are apparently more versions than you might think. On the FreeBSD
10.4 machine I have access to,
top reports CPU percentage in the
same way Linux does (100% is a single-threaded process using all
of one CPU). Although both the FreeBSD version and our OmniOS version
say they're the William LeFebvre implementation and have similar
version numbers, apparently they diverged significantly at some
point, probably when people had to start figuring out how to make
the original version of
top deal with multi-CPU machines.
Some views on more flexible (Illumos) kernel crash dumps
In my notes about Illumos kernel crash dumps, I mentioned that we've now turned them off on our OmniOS fileservers. One of the reasons for this is that we're running an unsupported version of OmniOS, including the kernel. But even if we were running the latest OmniOS CE and had commercial support, we'd do the same thing (at least by default, outside of special circumstances). The core problem is that our needs conflict with what Illumos crash dumps want to give us right now.
The current implementation of kernel crash dumps basically prioritizes
capturing complete information. There are various manifestations
of this in the implementation, starting with how it assumes that
if crash dumps are configured at all, you have set up enough disk
space to hold the full crash dump level you've set in
dumpadm, so it's sensible to not bother
checking if the dump will fit and treating failure to fit as an
unusual situation that is not worth doing much special about. Another
one is the missing feature that there is no overall time limit on
how long the crash dump will run, which is perfectly sensible if the
most important thing is to capture the crash dump for diagnosis.
But, well, the most important thing is not always to capture complete diagnostic information. Sometimes you need to get things back into service before too long, so what you really want is to capture as much information as possible while still returning to service in a certain amount of time. Sometimes you only have so much disk space available for crash dumps, and you would like to capture whatever information can fit in that disk space, and if not everything fits it would be nice if the most important things were definitely captured.
All of this makes me wish that Illumos kernel crash dumps wrote certain critical information immediately, at the start of the crash dump, and then progressively extended the information in the crash dump until they either ran out of space or ran out of time. What do I consider critical? My first approximation would be, in order, the kernel panic, the recent kernel messages, the kernel stack of the panicing kernel process, and the kernel stacks of all processes. Probably you'd also want anything recent from the kernel fault manager.
The current Illumos crash dump code does have an order for what gets written out, and it does put some of this stuff into the dump header, but as far as I can tell the dump header only gets written at the end. It's possible that you could create a version of this incremental dump approach by simply writing out the incomplete dump header every so often (appropriately marked with how it's incomplete). There's also a 'dump summary' that gets written at the end that appears to contain a bunch of this information; perhaps a preliminary copy could be written at the start of the dump, then overwritten at the end if the dump is complete. Generally what seems to take all the time (and space) with our dumps is the main page writing stuff, not a bunch of preliminary stuff, so I think Illumos could definitely write at least one chunk of useful information before it bogs down. And if this needs extra space in the dump device, I would gladly sacrifice a few megabytes to have such useful information always present.
(It appears that the Illumos kernel already keeps a lot of ZFS data
memory out of kernel crash dumps, both for the ARC and for in-flight
ZFS IO, so I'm not sure what memory the kernel is spending all of
its time dumping in our case. Possibly we have a lot of ZFS metadata,
which apparently does go into crash dumps. See the comments about
crash dumps in abd.c
For the 'dump summary', see the
dump_messages functions in dumpsubr.c.)
PS: All of this is sort of wishing from the sidelines, since our future is not with Illumos.
Some notes about kernel crash dumps in Illumos
On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).
We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).
The resulting Twitter discussion got me to take a look into the
current state of the code for this in Illumos, and I wound up
discovering some potentially interesting things. First off, dump
settings are not auto-loaded or auto-saved by the kernel in some
magical way; instead
dumpadm saves all of your configuration
/etc/dumpadm.conf and then sets them during boot
through svc:/system/dumpadm:default. The
page will tell you all of
this if you read its description of the
-z argument to
dumpadm is inadequately described in
the manual page. The 'crash
dump compression' it's talking about is whether
write compressed dumps; it has nothing to do with how the kernel
writes out the crash dump to your configured crash device. In fact,
dumpadm has no direct control over basically any of that process;
if you want to change things about the kernel dump process, you
need to set kernel variables through
/etc/system (or '
The kernel writes crash dumps in multiple steps. If your console
shows the message '
dumping to <something>, offset NNN, contents:
<...>', then you've at least reached the start of writing out the
crash dump. If you see updates of the form '
dumping: MM:SS N%
done', the kernel has reached the main writeout loop and is writing
out pages of memory, perhaps excessively slowly. As far as I can
tell from the code, crash dumps don't abort when they run out of
space on the dump device; they keep processing things and just
throw all of the work away.
As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.
(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)
I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.
If you examine the source of dumpsubr.c,
you'll discover a tempting variable
dump_timeout that's set to
120 (seconds) and described as 'timeout for dumping pages'. This
comment is a little bit misleading, as usual; what it really means
is 'timeout for dumping a single set of pages'. There is no limit
on how long the kernel is willing to keep writing out pages for,
provided that it makes enough progress within 120 seconds. In our
case this is unfortunate, since we'd be willing to spend a few
minutes to gather a bit of crash information but not anything like
what a kernel dump appears to take on our machines.
(The good news is that if you run out of space on your dump device,
the dump code is at least smart enough to not spend any more time
trying to compress pages; it just throws them away right away. You
might run out of space because you're taking a panic dump from a
ZFS fileserver with 128 GB of RAM and
putting it on an 8GB dump zvol that is part of a rpool that lives
on 80 GB SSDs, where a full-sized kernel dump almost certainly can't
even be saved by
PS: To see that the default is still a single-threaded crash dump,
you need to chase through the code to dumphdr.h
and the various
DUMP_PLAT_*_MINCPU definitions, all of which
are set to 0. Due to how the code is structured, this disables
multi-threaded dumps entirely.
Sidebar: The theoretical controls for multi-threaded dumps
If you set
dump_plat_mincpu to something above 0, then if you
have 'sufficiently more' CPUs than this, you will get parallel bzip2
compression; below that you will get parallel lzjb. Since parallel
compression is disabled by default in Illumos, this may or may not
actually still work, even if you don't run into any actual bugs of
the sort that caused it to be disabled in the first place. Note
that bzip2 is not fast.
The actual threshold of 'enough' depends on the claimed maximum
transfer size of your disks. For dumping to zvols, it appears that
this maximum transfer size is always 128 KB, which uses a code path
where the breakpoint between parallel lzjb and parallel bzip2 is just
dump_plat_mincpu; if you have that many CPUs or more, you get
bzip2. This implies that you may want to set
dump_plat_mincpu to a
nice high number so that you get parallel lzjb all the time.