Wandering Thoughts


'Scanned' versus 'issued' numbers for ZFS scrubs (and resilvers)

Sufficiently recent versions of ZFS have new 'zpool status' output during scrubs and resilvers. The traditional old output looks like:

scan: scrub in progress since Sat Feb  9 18:30:40 2019
      125G scanned out of 1.74T at 1.34G/s, 0h20m to go
      0B repaired, 7.02% done

(As you can probably tell from the IO rate, this is a SSD-based pool.)

The new output adds an additional '<X> issued at <RATE>' note in the second line, and in fact you can get some very interesting output in it:

scan: scrub in progress since Sat Feb  9 18:36:33 2019
      215G scanned at 2.24G/s, 27.6G issued at 294M/s, 215G total
      0B repaired, 12.80% done, 0 days 00:10:54 to go

Or (with just the important line):

      271G scanned at 910M/s, 14.5G issued at 48.6M/s, 271G total

In both cases, this claims to have 'scanned' the entire pool but has only 'issued' a much smaller amount of IO. As it turns out, this is a glaring clue as to what is going on, which is that these are the new sequential scrubs in action. Sequential scrubs (and resilvers) split the non-sequential process of scanning the pool into two sides, scanning through metadata to figure out what IOs to issue and then, separately, issuing the IOs after they have been sorted into order (I am pulling this from this presentation, via). A longer discussion of this is in the comment at the start of ZFS on Linux's dsl_scan.c.

This split is what the new 'issued' number is telling you about. In sequential scrubs and resilvers, 'scanned' is how much metadata and data ZFS has been able to consider and queue up IO for, while 'issued' is how much IO has been actively queued to vdevs. Note that it is not physical IO; instead it is progress through what 'zpool list' reports as ALLOC space, as covered in my entry on ZFS scrub rates and speeds.

(All of these pools I'm showing output from use mirrored vdevs, so the actual physical IO is twice the 'issued' figures.)

As we can see from these examples, it is possible for ZFS to completely 'scan' your pool before issuing much IO. This is generally going to require that your pool is relatively small and also that you have a reasonable amount of memory, because ZFS limits how much memory it will use for all of those lists of not yet issued IOs that it is sorting into order. Once your pool is fully scanned, the reported scan rate will steadily decay, because it's computed based on the total time the scrub or resilver has been running, not the amount of time that ZFS took to hit 100% scanned.

(In the current ZFS on Linux code, this memory limit appears to be a per-pool one. On the one hand this means that you can scan several pools at once without one pool limiting the others. On the other hand, this means that scanning multiple pools at once may use more memory than you're expecting.)

Sequential scrubs and resilvers are in FreeBSD 12 and will appear in ZFS on Linux 0.8.0 whenever that is released (ZoL is currently at 0.8.0-rc3). It doesn't seem to be in Illumos yet, somewhat to my surprise.

(This entry was sparked by reading this question to the BSD Now program, via BSD Now 281, which I stumbled over due to my Referer links.)

ZFSScrubScannedVsIssued written at 19:21:15; Add Comment


A bit of Sun's history that still lingers on in Illumos

The uname command (and system call) exist to give you various information about the machine you're on. For example, what Unix it runs, which is handy if you have scripts (or programs) that are run on multiple Unixes where you want to do different things.

(This is our situation with our two concurrent sets of fileservers, the OmniOS ones and the Linux ones.)

The result from 'uname -s', the name of the operating system, is pretty straightforward (unlike some of the other uname options; go ahead, try to guess what 'uname -i' is going to give you on a random Unix). On FreeBSD you get FreeBSD, on OpenBSD you get OpenBSD, on Linux you get Linux or, if you insist with 'uname -o', GNU/Linux. On OmniOS and in fact any Illumos system, well:

$ uname -s

Wait, what?

Once upon a time there was Sun Microsystems, who made some of the first Unix workstations. Their Unix was a version of BSD Unix, and like basically every early Unix company they couldn't actually call it 'Unix' for various reasons. So they called it SunOS, and it had a storied history that is too long to cover here (especially SunOS 3.x and 4.x). It of course identified itself as 'SunOS' in various things, because that was its name.

In the early 1990s, Sun changed the name of their Unix from SunOS to Solaris at the same time as they replaced the code base with one based on System V Release 4 (which they had had a hand in creating). Okay, officially 'SunOS 5' was there as a component of this Solaris thing, but good luck finding much mention of that or very many people who considered 'SunOS 5' to be a continuation of SunOS. However, 'uname -s' (still) reported 'SunOS', possibly because of that marketing decision.

(I'm not sure if SunOS 3 or SunOS 4 had a uname command, since it came from System V. By the way, this history of 'SunOS 5' being the base component of Solaris is probably why 'uname -r' reports the release of Illumos as '5.11' instead of '11'.)

Once the early versions of Solaris reported themselves to be 'SunOS', Sun was stuck with it in the name of backward compatibility. Scripts and programs that wanted to check for Solaris knew to check for a OS name of SunOS and then a 'uname -r' of 5.* (and as SunOS 4 faded away people stopped bothering with the second check); changing the reported operating system name would break them all. No one was going to do that, especially not Sun.

When OpenSolaris spawned from Solaris, of course the 'uname -s' output had to stay the same. When OpenSolaris became Illumos, the same thing was true. And so today, our OmniOS machines cheerfully tell us that they're running SunOS, an operating system name that is now more than 30 years old. It's a last lingering trace of a company that changed the world.

(In Illumos, this is hard-coded in uts/common/os/vers.c.)

(I was reminded of all of this recently as I was changing one of our fileserver management scripts so that it would refuse to run on anything except our OmniOS fileservers. Checking 'uname -s' is really the correct way to do this, which caused me to actually run it on our OmniOS machines for the first time in a while.)

LingeringSunOSHistory written at 22:58:51; Add Comment


The potential risk to ZFS created by the shift in its userbase

The obvious conclusion we can draw from FreeBSD ZFS's shift to being based on ZFS on Linux is that the center of gravity of open source ZFS development has shifted to ZFS on Linux. FreeBSD ZFS is shifting its upstream because ZFS on Linux is increasingly where new development happens. A more imprecise and deeper conclusion is that in some sense, the ZFS userbase as a whole is increasingly shifting to Linux.

(It may be that the total numbers of people using ZFS on FreeBSD and Illumos is higher than the people using ZFS on Linux. But if so, the users of FreeBSD and Illumos ZFS don't seem to drive development in the way that happens with ZFS on Linux. There are many possible explanations for this, of course, because there are many factors involved.)

Unfortunately, I think that this shift creates risk due to a consequence of ZFS on Linux not being GPL-compatible, which is that working with the Linux kernel's ever-changing API could become sufficiently much of a problem that ZFS on Linux development could end up coming to a halt. If ZFS as a whole is increasingly dependent on ZFS on Linux development and ZFS on Linux development is partly a hostage to the not particularly friendly views of Linux kernel developers, a problem there affects not just ZFS on Linux but everyone.

At least some of the users of ZFS on Linux are OS-agnostic and are using Linux because they feel it is their best overall choice (that's certainly our view for our new fileservers). If ZoL ground to a halt and stopped being viable, they'd switch over to other OSes (FreeBSD for us). But the question there is whether they'd bring the development resources with them, so that people would do and fund more ZFS development on non-Linux platforms, or if some of the work that is currently being done to ZoL would basically just evaporate.

(One of the things I wonder about is if ZFS on Linux's existence outside the kernel has helped drive development of it. If you're interested in a new ZoL feature, you can develop or fund it on your own schedule, then probably deploy it even on old Linux kernels such as those found in the various 'long term support' Linuxes (because ZoL as a whole is careful to support them). In FreeBSD and Illumos, I think you're much more locked to the overall kernel version and thus the OS version; if you fund a new ZFS feature, you could wind up needing to deploy a whole new OS release to get what you want. If nothing else, this is a longer deployment cycle.)

Given this risk, I definitely hope that people keep doing ZFS work (and funding it) outside of ZFS on Linux. But unfortunately I'm not sure how likely that is. It's quite possible that there are plenty of other people like us, making a pragmatic choice that ZoL is currently good enough and means we can use Linux instead of having to build out and maintain another OS.

(Illumos is not viable for us today for various reasons. FreeBSD might have been, but we didn't try to evaluate it since Linux worked well enough and we already run lots of other Linux machines.)

ZFSUserbaseShiftRisk written at 00:33:54; Add Comment


Some things on ZFS's per-pool performance statistics

Somewhat to my surprise, I recently found out that ZFS has had basic per-pool activity and performance statistics for a while (they're old enough that they're in our version of OmniOS, which is not exactly current these days). On sufficiently modern versions of ZFS (currently only the development version of ZFS on Linux), you can even get a small subset of these per-pool stats for each separate dataset, which may be useful for tracking activity. To be clear, these are not the stats that are made visible through 'zpool iostat'; these are a separate set of stats that are visible through (k)stats, and which can be picked up and tracked by at least some performance metrics systems on some platforms.

(Specifically, I know that Prometheus's host agent can collect them from ZFS on Linux. In theory you could add collecting them on OmniOS to the agent, perhaps using my Go kstat package, but someone would have to go to that programming work and I don't know if the Prometheus people would accept it. I haven't looked at other metrics host agents to see if they can collect this information on OmniOS or other Illumos systems. I suspect that no host agent can collect the more detailed 'zpool iostat' statistics because as far as I know there's no documented API to obtain them programatically.)

The best description of the available stats is in the kstat(3kstat) manpage, in the description of 'I/O Statistics' (for ZFS on Linux, you want to see here). Most of the stats are relatively obvious, but there's two important things to note for them. First, as far as I know (and can tell), these are for direct IO to disk, not for user level IO. This shows up in particular for writes if you have any level of vdev redundancy. Since we use mirrors, the amount written is basically twice the user level write rate; if a user process is writing at 100 MB/sec, we see 200 MB/sec of writes. This is honest but a little bit confusing for keeping track of user activity.

The second is that the rtime, wtime, rlentime, and wlentime stats are not distinguishing between read and write IO, but between 'run' and 'wait' IO. This is best explained in a comment from the Illumos kstat manpage:

A large number of I/O subsystems have at least two basic "lists" of transactions they manage: one for transactions that have been accepted for processing but for which processing has yet to begin, and one for transactions which are actively being processed (but not done). For this reason, two cumulative time statistics are defined here: pre-service (wait) time, and service (run) time.

I don't know enough about the ZFS IO queue code to be able to tell you if a ZFS IO being in the 'run' state only happens when it's actively submitted to the disk (or other device), or if ZFS has some internal division. The ZFS code does appear to consider IO 'active' at the same point as it makes it 'running', and based on things I've read about the ZIO scheduler I think this is probably at least close to 'it was issued to the device'.

(On Linux, 'issued to the device' really means 'put in the block IO system'. This may or may not result in it being immediately issued to the device, depending on various factors, including how much IO you're allowing ZFS to push to the device.)

If you have only moderate activity, it's routine to have little or no 'wait' (w) activity or time, with most of the overall time for request handling being in the 'run' time. You will see 'wait' time (and queue sizes) rise as your ZFS pool IO level rises, but before then you can have an interesting and pretty pattern where your average 'run' time is a couple of milliseconds or higher but your average 'wait' time is in the microseconds.

In terms of Linux disk IO stats, the *time stats are the equivalent of the use stat, and the *lentime stats are the equivalent of the aveq field. There is no equivalent of the Linux ruse or wuse fields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.

For ZFS datasets, you currently get only reads, writes, nread, and nwritten. For datasets, the writes appear to be user-level writes, not low level disk IO writes, so they will track closely with the amount of data written at the user level (or at the level of an NFS server). As I write this here in early 2019, these per-dataset stats aren't in any released version of even ZFS on Linux, but I expect to see them start showing up in various places (such as FreeBSD) before too many years go by.

PS: I regret not knowing that these stats existed some time ago, because I probably would have hacked together something to look at them on our OmniOS machines, even though we never used 'zpool iostat' very much for troubleshooting for various reasons. In general, if you have multiple ZFS pools it's always useful to be able to see what the active things are at the moment.

ZFSPerPoolStats written at 00:19:31; Add Comment


Some notes on ZFS prefetch related stats

For reasons beyond the scope of this entry, I've recently been looking at ARC stats, in part using a handy analysis program by Richard Elling. This has gotten me looking again at ZFS (k)stats related the prefetching, which I touched on before in my entry on some basic ZFS ARC statistics and prefetching. So here are some notes on what I think these mean or might mean.

To be able to successfully prefetch at all, ZFS needs to recognize and predict your access pattern. The extent to which it can do this is visible in the ZFS DMU zfetchstats kstats; zfetchstats:hits is the number of reads that matched a prediction stream, while zfetchstats:misses is the number of reads that did not match one. If zfetchstats:hits is low, there are two possible reasons; you could have a mostly random IO pattern, or you could have too many different sequential streams reading from the same file(s) at once. In theory there is a kstat that counts 'you had too many streams for this file and I couldn't create a new one', zfetchstats:max_streams. In practice this seems to be useless and you can't really tell these cases apart, because as far as I can tell even random access to files creates ZFS prefetch streams.

Every file can have at most zfetch_max_streams streams (default 8), and even streams that have never matched any reads aren't removed for zfetch_min_sec_reap seconds (default 2). So when you start doing random reads to a new file, as far as I can tell your first 8 random reads will immediately create 8 DMU prefetch streams and then every read after that will still try to create a new one but fail because you've hit the maximum stream count for the file. Since the streams are maxed out, each new random read will increment both zfetchstats:misses (since it doesn't match any existing stream) and zfetchstats:max_streams (since the file has 8 streams). Every two seconds, your current streams expire and you get 8 new ones from the next 8 random reads you do.

(This theory matches the numbers I see when I produce a flood of random reads to a large file with ioping. Our ZFS fileservers do show a slowly growing difference between the two stats.)

As discussed in my previous entry, the ARC 'prefetch hits' statistics count only how many prefetch reads were found in the ARC instead of needing to be read from disk. A high prefetch ARC hit rate means that you're doing sequential reads of files that are already in the ARC and staying there (either because you've read them before or because you recently wrote them). A low prefetch ARC hit rate means that this isn't happening, but there are multiple reasons for this. Obviously, one cause is that your sequential re-reads are collectively too large for your ARC and so at least some of them are being evicted before you re-read them. Another cause is that you're mostly not re-reading things, at least not very soon; most of the time you read a file once and then move on.

If you know or believe that your workload should be in the ARC, a low ARC prefetch hit rate or more exactly a high ARC prefetch miss count is a sign that something is off, since it means that your prefetch reads are not finding things in the ARC that you expect to be there. A low ARC prefetch hit rate is not necessarily otherwise a problem.

I believe that there are situations where you will naturally get a low ARC prefetch hit rate. For example, if you perform a full backup of a number of ZFS filesystems with tar, I would expect a lot of ARC prefetch misses, since it's unlikely that you can fit all the data from all of your filesystems into ARC. And this is in fact the pattern we see on our ZFS fileservers during our Amanda backups. On the other hand, you should see a lot of ARC demand data hits, since prefetching itself should be very successful (and this is also the pattern we see).

ZFSPrefetchStatsNotes written at 23:56:26; Add Comment


FreeBSD ZFS will be changing to be based on ZFS on Linux

The big news today is contained in an email thread called The future of ZFS in FreeBSD (via any number of places), where the crucial summary is:

[...] This state of affairs has led to a general agreement among the stakeholders that I have spoken to that it makes sense to rebase FreeBSD's ZFS on [ZFS on Linux]. [...]

This is not a development that I expected, to put it one way. Allan Jude had some commentary on OpenZFS in light of this that's worth reading for interested parties. There's also a discussion of this with useful information on lobse.rs, including this comment from Joshua M. Clulow mentioning that Illumos developers are likely to be working to upstream desirable ZFS on Linux changes.

(I had also not known that Delphix is apparently moving away from Illumos.)

I've used ZFS on Linux for a fairly long time now, and we settled on it for our next generation of fileservers, but all during that time I was under the vague impression that ZFS on Linux was the red-headed stepchild of ZFS implementations. Finding out that it's actually a significant source of active development is a bit of a surprise. On the one hand, it's a pleasant surprise, and it definitely helps me feel happy about our choice of ZFS on Linux. On the other hand it makes me feel a bit sad about Illumos, because upstream ZFS development in Illumos now appears to be yet another thing that is getting run over by the Linux juggernaut. Even for Linux people, that is not entirely good news.

(And, well, we're part of that Linux juggernaut, since we chose ZFS on Linux instead of OmniOS CE. For good reasons, sadly.)

On the other hand, if this leads to more feature commonality and features becoming available sooner in FreeBSD, Illumos, and ZFS on Linux, it will be a definite win. Even if it simply causes everyone to wind up with an up to date test suite with shared tests for common features that run regularly, that's probably a general win (and certainly a good thing).

(One thing that it may do in the short term is propagate FreeBSD's support for ZFS sending TRIM commands to disks into ZFS on Linux, which would be useful for us in the long run.)

PS: The fundamental problem with ZFS on Linux remains that ZFS on Linux will almost certainly never be included in the Linux kernel. This means that it will always be a secondary filesystem on Linux. As a result I think it's not really a great thing if ZFS on Linux becomes the major source of ZFS development; the whole situation with Linux and ZFS is always going to be somewhat precarious, and things could always go wrong. It's simply healthier for ZFS overall if significant ZFS development continues on a Unix where ZFS is a first class citizen, ideally fully integrated into the kernel and fully supported.

ZFSFreeBSDChangesBase written at 00:17:23; Add Comment


Some basic ZFS ARC statistics and prefetching

I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?

(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)

If you read the arcstat manpage or look at the raw ARC kstats, you'll very soon notice things like 'prefetch hits percentage' arcstat field and a prefetch_data_hits kstat. Unfortunately these prefetch-related kstats will not give us what we want, because they mean something different that what you might expect.

The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.

(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)

All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:

  • The ARC demand hit rate (for data and metadata together) is probably the most important thing for whether the ARC is giving you good results, although this partly depends on the absolute volume of demand reads and a few other things. Demand misses mean that programs are waiting on disk IO.

  • The breakdown of *_miss kstats will tell you why ZFS is reading things from disk. You would generally like this to be prefetch reads instead of demand reads, because at least things aren't waiting on prefetch reads.

  • The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.

One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.

Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:

  • demand_hit_predictive_prefetch counts demand reads that found data in the ARC from prefetch reads. This sounds exactly like what we want, but experimentally it doesn't seem to come anywhere near fully accounting for hits on prefetched data; I see low rates of it when I am also seeing a 100% demand hit rate for sequentially read data that was not previously in the ARC.

  • sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.

My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.

If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.

If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.

One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.

(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)

ZFSARCStatsAndPrefetch written at 23:17:12; Add Comment


Today I (re-)learned that top's output can be quietly system dependent

I'll start with a story that is the background. A few days ago I tweeted:

Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.

(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)

I was wrong about this being disk-limited, as it turned out, and then Allan Jude had the winning suggestion:

Try adding '-c aes128-gcm@openssh.com' to your SSH invocation.

See also: <pdf link>

(If you care about 10G+ SSH, you want to read that PDF.)

This made a huge difference, giving me basically 1G wire speeds for my ZFS transfers. But that difference made me scratch my head, because why was switching SSH ciphers making a difference when ssh wasn't CPU-limited in the first place? I came up with various theories and guesses, until today I had a sudden terrible suspicion. The result of testing and confirming that suspicion was another tweet:

Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.

(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)

You see, the reason that I had confidently known that SSH wasn't CPU-limited on sending machine, which was one of our OmniOS fileservers, is that I had run top and seen that the ssh process was only using 25% of the CPU. Case closed.

Except that OmniOS top and Linux's top report CPU usage percentages differently. On Linux, CPU percentage is relative to a single CPU, so 25% is a quarter of one CPU, 100% is all of it, and over 100% is a multi-threaded program that is using up more than one CPU's worth of CPU time. On OmniOS, the version of top we're using comes from pkgsrc (in what is by now a very old version), and that version reports CPU percentage relative to all CPUs in the machine. Our OmniOS fileservers are 4-CPU machines, so that '25% CPU' was actually 'all of a single CPU'. In other words, I was completely wrong about the sending ssh not being CPU-limited. Since ssh was CPU limited after all, it's suddenly no surprise why switching ciphers sped things up to basically wire speed.

(Years ago I established that the old SunSSH that OmniOS was using back then was rather slow, but then later we upgraded to OpenSSH and I sort of thought that I could not worry about SSH speeds any more. Well, I was wrong. Of course, nothing can beat not doing SSH at all but instead using, say, mbuffer. Using mbuffer also means that you can deliberately limit your transfer bandwidth to leave some room for things like NFS fileservice.)

PS: There are apparently more versions than you might think. On the FreeBSD 10.4 machine I have access to, top reports CPU percentage in the same way Linux does (100% is a single-threaded process using all of one CPU). Although both the FreeBSD version and our OmniOS version say they're the William LeFebvre implementation and have similar version numbers, apparently they diverged significantly at some point, probably when people had to start figuring out how to make the original version of top deal with multi-CPU machines.

TopCPUPercentDifference written at 23:01:36; Add Comment


Some views on more flexible (Illumos) kernel crash dumps

In my notes about Illumos kernel crash dumps, I mentioned that we've now turned them off on our OmniOS fileservers. One of the reasons for this is that we're running an unsupported version of OmniOS, including the kernel. But even if we were running the latest OmniOS CE and had commercial support, we'd do the same thing (at least by default, outside of special circumstances). The core problem is that our needs conflict with what Illumos crash dumps want to give us right now.

The current implementation of kernel crash dumps basically prioritizes capturing complete information. There are various manifestations of this in the implementation, starting with how it assumes that if crash dumps are configured at all, you have set up enough disk space to hold the full crash dump level you've set in dumpadm, so it's sensible to not bother checking if the dump will fit and treating failure to fit as an unusual situation that is not worth doing much special about. Another one is the missing feature that there is no overall time limit on how long the crash dump will run, which is perfectly sensible if the most important thing is to capture the crash dump for diagnosis.

But, well, the most important thing is not always to capture complete diagnostic information. Sometimes you need to get things back into service before too long, so what you really want is to capture as much information as possible while still returning to service in a certain amount of time. Sometimes you only have so much disk space available for crash dumps, and you would like to capture whatever information can fit in that disk space, and if not everything fits it would be nice if the most important things were definitely captured.

All of this makes me wish that Illumos kernel crash dumps wrote certain critical information immediately, at the start of the crash dump, and then progressively extended the information in the crash dump until they either ran out of space or ran out of time. What do I consider critical? My first approximation would be, in order, the kernel panic, the recent kernel messages, the kernel stack of the panicing kernel process, and the kernel stacks of all processes. Probably you'd also want anything recent from the kernel fault manager.

The current Illumos crash dump code does have an order for what gets written out, and it does put some of this stuff into the dump header, but as far as I can tell the dump header only gets written at the end. It's possible that you could create a version of this incremental dump approach by simply writing out the incomplete dump header every so often (appropriately marked with how it's incomplete). There's also a 'dump summary' that gets written at the end that appears to contain a bunch of this information; perhaps a preliminary copy could be written at the start of the dump, then overwritten at the end if the dump is complete. Generally what seems to take all the time (and space) with our dumps is the main page writing stuff, not a bunch of preliminary stuff, so I think Illumos could definitely write at least one chunk of useful information before it bogs down. And if this needs extra space in the dump device, I would gladly sacrifice a few megabytes to have such useful information always present.

(It appears that the Illumos kernel already keeps a lot of ZFS data memory out of kernel crash dumps, both for the ARC and for in-flight ZFS IO, so I'm not sure what memory the kernel is spending all of its time dumping in our case. Possibly we have a lot of ZFS metadata, which apparently does go into crash dumps. See the comments about crash dumps in abd.c and zio.c. For the 'dump summary', see the dump_summary, dump_ereports, and dump_messages functions in dumpsubr.c.)

PS: All of this is sort of wishing from the sidelines, since our future is not with Illumos.

ImprovingCrashDumps written at 23:46:22; Add Comment


Some notes about kernel crash dumps in Illumos

I tweeted:

On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).

We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).

The resulting Twitter discussion got me to take a look into the current state of the code for this in Illumos, and I wound up discovering some potentially interesting things. First off, dump settings are not auto-loaded or auto-saved by the kernel in some magical way; instead dumpadm saves all of your configuration settings in /etc/dumpadm.conf and then sets them during boot through svc:/system/dumpadm:default. The dumpadm manual page will tell you all of this if you read its description of the -u argument.

Next, the -z argument to dumpadm is inadequately described in the manual page. The 'crash dump compression' it's talking about is whether savecore will write compressed dumps; it has nothing to do with how the kernel writes out the crash dump to your configured crash device. In fact, dumpadm has no direct control over basically any of that process; if you want to change things about the kernel dump process, you need to set kernel variables through /etc/system (or 'mdb -k').

The kernel writes crash dumps in multiple steps. If your console shows the message 'dumping to <something>, offset NNN, contents: <...>', then you've at least reached the start of writing out the crash dump. If you see updates of the form 'dumping: MM:SS N% done', the kernel has reached the main writeout loop and is writing out pages of memory, perhaps excessively slowly. As far as I can tell from the code, crash dumps don't abort when they run out of space on the dump device; they keep processing things and just throw all of the work away.

As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.

(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)

I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.

If you examine the source of dumpsubr.c, you'll discover a tempting variable dump_timeout that's set to 120 (seconds) and described as 'timeout for dumping pages'. This comment is a little bit misleading, as usual; what it really means is 'timeout for dumping a single set of pages'. There is no limit on how long the kernel is willing to keep writing out pages for, provided that it makes enough progress within 120 seconds. In our case this is unfortunate, since we'd be willing to spend a few minutes to gather a bit of crash information but not anything like what a kernel dump appears to take on our machines.

(The good news is that if you run out of space on your dump device, the dump code is at least smart enough to not spend any more time trying to compress pages; it just throws them away right away. You might run out of space because you're taking a panic dump from a ZFS fileserver with 128 GB of RAM and putting it on an 8GB dump zvol that is part of a rpool that lives on 80 GB SSDs, where a full-sized kernel dump almost certainly can't even be saved by savecore.)

PS: To see that the default is still a single-threaded crash dump, you need to chase through the code to dumphdr.h and the various DUMP_PLAT_*_MINCPU definitions, all of which are set to 0. Due to how the code is structured, this disables multi-threaded dumps entirely.

Sidebar: The theoretical controls for multi-threaded dumps

If you set dump_plat_mincpu to something above 0, then if you have 'sufficiently more' CPUs than this, you will get parallel bzip2 compression; below that you will get parallel lzjb. Since parallel compression is disabled by default in Illumos, this may or may not actually still work, even if you don't run into any actual bugs of the sort that caused it to be disabled in the first place. Note that bzip2 is not fast.

The actual threshold of 'enough' depends on the claimed maximum transfer size of your disks. For dumping to zvols, it appears that this maximum transfer size is always 128 KB, which uses a code path where the breakpoint between parallel lzjb and parallel bzip2 is just dump_plat_mincpu; if you have that many CPUs or more, you get bzip2. This implies that you may want to set dump_plat_mincpu to a nice high number so that you get parallel lzjb all the time.

IllumosCrashDumpNotes written at 01:37:56; Add Comment

(Previous 10 or go back to October 2018 at 2018/10/24)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.