2007-11-13
Why vfork() got created (part 2)
Although fork() theoretically copies the address space of the parent
process to create the child's address space, Unix normally does not
actually make a real copy. Instead it marks everything as copy on
write and then waits for
either the parent or the child to dirty a page, whereupon it does copies
only as much as it has to. This makes forks much faster, which is very
important since things in Unix fork a lot.
(Disclaimer: this is the traditional implementation for paged virtual memory. I don't know what V7 did, since it only had swapping.)
At least that is the theory and the ideal, but not in this case the practical.
The second and probably real reason that vfork()
exists is that when UC Berkeley was adding paged virtual memory to
Unix, they couldn't get copy on write working on their cheap Vax 11/750s (although it did work on the
higher end 11/780s). To avoid a fairly bad performance hit on what were
already low end machines, they added vfork() (which doesn't require
copy on write to run fast) and modified sh and csh to use it.
The specific problem was apparently bugs in the 750 microcode that caused writes to read only pages in the stack to not fault correctly. One of the reasons that the 750 was cheaper than the 780 was that it did a number of things in microcode that the 780 did in hardware, which explains why 780s didn't have this problem.
(My source for the details is a message from John Levine here.)
2007-11-11
Why vfork() got created (part 1)
The fork() system call presents a problem for strict virtual memory
overcommit. Because it duplicates a process's virtual
address space, strictly correct accounting requires that the child
process be instantly charged for however much committed address space the
parent process has; if this puts the system over the commit limit, the
fork() should fail.
At the same time, in practice most fork() child processes don't use
very much of the memory that they're being charged for; almost all the
time they touch only a few pages and then throw everything away by
calling exec(). Failing such a fork() because the strict accounting
says you should is very irritating to people; it is the system using
robot logic. But at the time of the fork(), the
system has no way of telling parsimonious 'good' child processes that
will promptly exec() from child processes that are going to stick
around and do a lot of work and use a lot of their committed address
space.
The answer (I will not call it a solution) is vfork(), which is a
fork() that doesn't require the kernel to charge the child process for
any committed address space. This allows large processes to spawn other
programs without running into artificial limits, at the cost of being a
hack itself.
(In order to make this work the child can't actually get any pages of
its own; instead it gets to use the parent's pages, and to make this
work the parent process gets frozen until the child exits or exec()s.)
Actually this is a bit of a lie, because it is only half the reason
that vfork() exists. But that's another entry, because this one is
already long enough.
2007-11-07
Understanding the virtual memory overcommit issue
First, a definition: the committed address space is the total amount of virtual memory that the kernel might have to supply real memory pages for, either in swap space or in RAM. In other words, this is how much memory the kernel has committed to supplying to programs if they all decide to touch all of the memory they've requested from the kernel.
(This is less than the total amount of virtual memory used in the system, since some things, like program code and memory mapped files, don't need swap space.)
In the old days, how much committed address space a Unix kernel would give out was simple but limited: the amount of swap space you had. When people starting moving beyond this, they ran into two issues:
- the kernel needs some amount of memory for itself in order to
operate.
- programs do not necessarily use all of the memory that they've
requested from the kernel, especially when the request is sort
of implicit (such as when a process
fork()s).
If we could ignore both issues, the committed address space the kernel should give out would be simple: the sum of physical memory plus swap space. Since we can't, the question is how much should we adjust the number for each issue. Unfortunately both issues are unpredictable and depend on what you're doing with your system and on how cautious you need to be about never hitting a situation where the kernel has overcommitted memory, so there is no universal answer, only heuristics and tuning knobs, and the various Unixes have wound up making different choices.
Note that these are choices. While people sometimes argue back and forth about them, the overall problem is a hard one and there is no universal right answer for what committed address space limit to use and how to behave in the face of overcommit.
Sidebar: the results of running out
If the kernel runs into its limit on committed address space it starts
giving errors when asked to do operations that require more, so programs
stop being able to do things like malloc() memory or fork() or start
new processes with big writeable data areas. If the kernel discovers
that it has overcommitted itself it is generally forced to start killing
processes when they try to use pages of memory that the kernel can't
actually supply at the moment.
(Sometimes the kernel winds up in a worse situation, if for example it needs memory for its own use but can't get it. This can lock up an entire machine instead of just killing processes.)
Programmers and system administrators generally prefer the former to the
latter; it is a lot easier to cope with malloc() failing than random
processes getting abruptly killed. At the same time they want failures
to only happen when the system is genuinely out of memory, not when the
kernel is just being conservative.