Wandering Thoughts archives

2015-04-03

Understanding the (original) meaning of Unix load average

Most everyone knows the load average, and almost every system administrator knows that it's not necessarily a useful measure today. The problem is that the load average combines two measurements, as it counts both how many processes are trying to run and how many processes are currently waiting for IO to finish. This means that a machine having a big load average tells you very little by itself; do you have a lot of processes using the CPU, a lot of processes doing IO, a few processes doing very slow IO, or perhaps a bunch of processes waiting for an NFS server to come back to life?

As it happens, I think there is an explanation for what the load average is supposed to mean and originally did mean, back in the early days of Unix. To put it simply, it's how soon your process would get to run.

To see how this makes sense, let's rewind time to the Vaxes that 3BSD ran on when load average was added to Unix. On those machines, two things were true: in CPU-relative terms IO was faster than it is now, and the CPU was simply slow in general so that doing anything much took appreciable compute time. This means that a process waiting on 'fast' disk IO is probably going to have the IO complete before you do much computation yourself and then the process's going to have to use enough CPU time to deal with the IO results that you're going to notice, even if it's doing relatively simple processing. So runnable processes are directly contending for the CPU right now and 'busy' processes in IO wait will be contending for it before you can do very much (and the kernel will soon be doing some amount of computing on their behalf). Both sorts of processes will delay yours and so merging them together in a single 'load average' figure makes sense.

This breaks down (and broke down) as CPUs became much faster in an absolute sense as well as much faster than IO. Today a process doing only basic IO processing will use only tiny amounts of CPU time and your CPU-needing process will probably hardly notice or be delayed by it. This makes the number of processes in IO wait basically meaningless as a predictor of how soon a ready process can run and how much of the CPU it'll get; you can do a lot before their slow IO completes and when it does complete they often need almost no CPU time before they go back to waiting on IO again. There's almost no chance that a 'busy' process in IO wait will block your process from getting a CPU slice.

(As a side note, including some indicator of disk load into 'load average' also makes a lot of sense in a memory-constrained environment where a great deal of what you type at your shell prompt requires reading things off disk, which is what early BSDs on Vaxes usually were. A 100% unused CPU doesn't help you if you're waiting to read the test binary in from the disk in the face of 10 other processes trying to do their own disk IO.)

unix/LoadAverageMeaning written at 04:04:02; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.