2007-09-10
A small drawback of 64-bit machines
It used to be that on a large memory 32-bit compute server, no single process could run away and exhaust all of the machine's memory. On an eight or sixteen gigabyte machine, processes ran into the 3 gigabyte (max) or so limit on per-process virtual address space well before they could run the machine itself into the ground.
(On a large enough machine you could survive a couple of such processes.)
This is no longer true on 64-bit large memory compute servers, as I noticed today; it is now possible for a single runaway process to take even a 32 gigabyte machine into an out of memory situation. I am now a bit nervous of what the kernel's OOM handling will do to us, since these are shared machines that can be running jobs for several people at once.
(Adding more swap space is probably not the solution.)
I have to say that the kernel OOM log messages are a beautiful case of messages being logged for developers instead of sysadmins. As a sysadmin, I would like a list of the top few processes by OOM score, with information like their start time, total memory usage, and their recent growth in memory usage if that information is available.
(And on machines with lots of CPUs, the kernel OOM messages get rather verbose. I hate to think what they will be like on our 16-core machine.)