A problem with strict memory overcommit in practice
We've used Linux's strict overcommit mode on our compute servers for years to limit memory usage to the (large) amount of RAM on the machines, on the grounds that compute jobs that allocate tens or hundreds of GB of RAM generally intend to use it. Recently we had a series of incidents where compute machines were run out of memory, and unfortunately these incidents illustrated a real problem with relying on strict overcommit handling. You see, in practice, strict overcommit kills random processes. Oh, not literally, but the effect is generally the same.
Quite a lot of programs need memory as they run, and when they can't
get it there is often not very much they can do except exit. This is
especially the case for shells and shell scripts; even if the shell
can get enough memory to do its own internal work, the moment it tries
to fork()
and exec()
some external program, it's going to fail, and there goes your shell script. All sorts
of things can start failing, including things that shouldn't fail. Do
you have a 'trap EXIT "rm $lockfile"
' in your shell script? Well,
rm
isn't a builtin, so your lock file is probably not going away.
Strict overcommit is blind here; it denies more memory to any and all processes that want it after the limit is hit, no matter what they are, how big they are, or what their allocation pattern has been. And in practice the amount of available memory doesn't just go to zero and stay there; instead, some things will try to allocate memory, fail, and exit, releasing their current memory, which allows other programs to get more memory for a bit and maybe finish and exit, and then the whole cycle bounces around. This oscillation is what creates the randomness, where if you ask at the right time you get memory and get to survive but if you ask at the wrong time you fail and die.
In our environment, processes failing unpredictably turns out to be surprisingly disruptive to routine ongoing maintenance tasks, like password propagation and managing NFS mounts. I'd say that our scripts weren't designed for things exploding half-way through, but I'm not sure there's any way to design scripts to survive in a setting where any command and any internal Bourne shell operation might fail at any time.
(We are taking steps to mitigate some problems.)
Despite our recent experience with this, we're probably going to stick with strict overcommit on our compute servers; it has saved us in other situations, and there really isn't a better alternative. The OOM killer has its own problems and is probably the wrong answer for our compute servers, especially in its current form.
(There is an argument that can be made that the OOM killer's apparent current approach of 'kill the biggest process' is actually a fair one on a shared compute server, but I think it's at least questionable.)
PS: We saw this last fall, but at the time we didn't really fully appreciate the potential problems and how hard they may be to deal with.
|
|