2010-12-19
A tale of memory allocation failure
I have a message for developers, and not the one you might think. Here is my message:
Sometimes memory allocation failures are not because the system is out of memory.
There is a subset of programmers who do not want to really deal with memory allocation failures, and to be honest I can't entirely disagree with them; recovering from memory allocation failures in a complex program that also wants to use as little memory as possible is quite non-trivial and rather contorted. Apart from just exiting on the spot, the programmers' non-coping strategies generally involve either retrying the allocation on the spot (usually after a short delay) or failing the immediate operation but not doing anything fundamental to change the program's state, so that it will almost immediately try to allocate more memory.
(For instance, you might try to respond to network input being available by allocating some data structures and then reading the data in and processing it. When your initial allocation fails, you immediately drop back into the main loop without doing anything more, where you once again find that there's network input ready to be read and the whole thing repeats.)
These are sort of sensible coping strategies if the system is out of memory; either you're likely to get memory back soon, or the system is likely to crash. But, well, not all memory allocation failures are because the system is out of memory. And that's where my story comes in.
After upgrading our login servers to Ubuntu 10.04 recently, we started
seeing a number of locked up GNU Emacs processes; clearly abandoned by
their users, they were sitting there burning CPU endlessly. Some work
with strace
showed that they were looping around trying to do memory
allocations, which kept failing. Oh, and all the processes were 3 GB in
size. On 32-bit machines.
(Technically they were just under 3 GB of address space; most of the tools I used to look at them rounded this up.)
Given that on conventional 32-bit x86 Linux kernels, processes can only have 3 GB of virtual address space, those allocations were never going to succeed. They were not failing because of any transient condition; they were failing because GNU Emacs had allocated all of the memory it ever could. Not doing something to fundamentally change the situation just sent the program into an endless loop.
(I'm honestly impressed that GNU Emacs managed to get so close to 3GB of address space, given various holes in the address space.)