2016-02-17
The many load averages of Unix(es)
It turns out that the meaning of 'load average' on Unixes is rather more divergent than I thought it was. So here's the story as I know it.
In the beginning, by which I mean 3 BSD, the
load average counted how many processes were runnable or in short
term IO wait (in a decaying average). The BSD kernel computed this
count periodically by walking over the process table; you can see
this in for example 4.2BSD's vmtotal() function.
Unixes that were derived from 4 BSD carried this definition of load
average forward, which primarily meant SunOS and Ultrix. Sysadmins
using NFS back in those days got very familiar with the 'short term
IO wait' part of load average, because if your NFS server stopped
responding, all of your NFS clients would accumulate lots of processes
in IO waits (which were no longer so short term) and their load
averages would go skyrocketing to absurd levels.
(Technically the definition was not 'IO wait', it was 'any process
that was sleeping with a non-interruptible priority'. In theory
this was only processes in IO wait. Yes, this included processes
waiting on NFS IO on NFS mounts marked intr; it's complicated.)
When Linux implemented the load average (which it did very early, as 0.96c has it), it copied this traditional definition. Linux load average has been 'run queue plus (short term) IO wait' ever since, although the exact mechanics of how it was computed have changed over time to be more efficient.
(Once multiprocessor systems and large numbers of processes showed up, people soon worked out that 'iterate over the entire process table' was not necessarily a good idea.)
When Sun executed the great SunOS 4 to Solaris transition, I'm not quite sure what happened to their definition of the load average. At least some sources claim that it was immediately redefined to drop IO waits (which would mean that a NFS client would maintain a low load average even when the NFS server went away). Exactly how Solaris counted up 'runnable processes' apparently changed somewhat in Solaris 10; in theory I think this is not supposed to affect the results materially. By Solaris 10 it seems definite that Solaris does not count processes in IO wait in the load average, and this has been carried forward into Illumos and derivatives.
(I looked at the Illumos source code very briefly and determined that it was complicated enough that it was too much work to understand it for this entry.)
The situation with the *BSDs is messy. I haven't thoroughly investigated historical source trees, but I can't imagine that 386BSD and then NetBSD people immediately changed the 4BSD definition of the load average to drop processes in IO wait. Certainly the FreeBSD 2.0 sources I have handy access to (via this Github repo) still count processes in IO wait. Then at some point things get very tangled and some of the available information I could find seems to be wrong (eg). The net result is that FreeBSD split apart from OpenBSD and NetBSD in load average calculations, and OpenBSD and NetBSD are somewhat divergent from each other.
As far as I can decode the current state of load average calculations on the three are:
- In FreeBSD, load average counts only runnable processes,
not processes in IO wait. The count of runnable processes is
maintained on the fly by the scheduler in code that I'm not
going to try to link to.
- In NetBSD,
kern/kern_synch.c'ssched_pstats()function counts both runnable processes and all sleeping processes that have slept for less than one second so far (at least that's what I thinkl_slptimeis counting). - In OpenBSD,
uvm/uvm_meter.c'suvm_loadav()function counts both runnable processes and sleeping processes that are in high priority IO wait and have slept for less than one second so far (assuming I understandp_slptimecorrectly). This is fewer sleeping processes than NetBSD seems to include.
(Don't ask me what Dragonfly BSD does here.)
This is all very messy and contradicts some things knowledgeable OpenBSD people have said. Mind you, they said them in 2009, but on the other hand I can't imagine that OpenBSD would have dropped and then restored counting processes in IO wait (and I can't find any sign of that in their CVS logs).
(I don't know what any other commercial Unixes do here, including Mac OS X. Energetic people are encouraged to do their own research.)
The real moral is that the exact definition of 'load average' is a mess today. If you think you care about load average, you should find out how much IO waiting and general sleeping it includes on your system, ideally via actual experimentation.
2016-02-08
Old Unix filesystems and byte order
It all started with a tweet by @JeffSipek:
illumos/solaris UFS don't use a fixed byte order. SPARC produces structs in BE, x86 writes them out in LE. I was happier before I knew this.
As they say, welcome to old time Unix filesystems. Solaris UFS is far from the only filesystem defined this way; in fact, most old time Unix filesystems are probably defined in host byte order.
Today this strikes us as crazy, but that's because we now exist in a quite different hardware environment than the old days had. Put simply, we now exist in a world where storage devices both can be moved between dissimilar systems and are. In fact, it's an even more radical world than that; it's a world where almost everyone uses the same few storage interconnect technologies and interconnects are common between all sorts of systems. Today we take it for granted that how we connect storage to systems is through some defined, vendor neutral specification that many people implement, but this was not at all the case originally.
(There are all sorts of storage standards: SATA, SAS, NVMe, USB, SD cards, and so on.)
In the beginning, storage was close to 100% system specific. Not only did you not think of moving a disk from a Vax to a Sun, you probably couldn't; the entire peripheral interconnect system was almost always different, from the disk to host cabling to the kind of backplane that the controller boards plugged into. Even as some common disk interfaces emerged, larger servers often stayed with faster proprietary interfaces and proprietary disks.
(SCSI is fairly old as a standard, but it was also a slow interface for a long time so it didn't get used on many servers. As late as the early 1990s it still wasn't clear that SCSI was the right choice.)
In this environment of system specific disks, it was no wonder that Unix kernel programmers didn't think about byte order issues in their on disk data structures. Just saying 'everything is in host byte order' was clearly the simplest approach, so that's what people by and large did. When vendors started facing potential bi-endian issues, they tried very hard to duck them (I think that this was one reason endian-switchable RISCs were popular designs).
In theory, vendors could have decided to define their filesystems as being in their current endianness before they introduced another architecture with a different endianness (here Sun, with SPARC, would have defined UFS as BE). In practice I suspect that no vendor wanted to go through filesystem code to make it genuinely fixed endian. It was just simpler to say 'UFS is in host byte order and you can't swap disks between SPARC Solaris and x86 Solaris'.
(Since vendors did learn, genuinely new filesystems were much more likely to be specified as having a fixed and host-independent byte order. But filesystems like UFS trace their roots back a very long way.)