A problem with strict memory overcommit in practice

February 5, 2019

We've used Linux's strict overcommit mode on our compute servers for years to limit memory usage to the (large) amount of RAM on the machines, on the grounds that compute jobs that allocate tens or hundreds of GB of RAM generally intend to use it. Recently we had a series of incidents where compute machines were run out of memory, and unfortunately these incidents illustrated a real problem with relying on strict overcommit handling. You see, in practice, strict overcommit kills random processes. Oh, not literally, but the effect is generally the same.

Quite a lot of programs need memory as they run, and when they can't get it there is often not very much they can do except exit. This is especially the case for shells and shell scripts; even if the shell can get enough memory to do its own internal work, the moment it tries to fork() and exec() some external program, it's going to fail, and there goes your shell script. All sorts of things can start failing, including things that shouldn't fail. Do you have a 'trap EXIT "rm $lockfile"' in your shell script? Well, rm isn't a builtin, so your lock file is probably not going away.

Strict overcommit is blind here; it denies more memory to any and all processes that want it after the limit is hit, no matter what they are, how big they are, or what their allocation pattern has been. And in practice the amount of available memory doesn't just go to zero and stay there; instead, some things will try to allocate memory, fail, and exit, releasing their current memory, which allows other programs to get more memory for a bit and maybe finish and exit, and then the whole cycle bounces around. This oscillation is what creates the randomness, where if you ask at the right time you get memory and get to survive but if you ask at the wrong time you fail and die.

In our environment, processes failing unpredictably turns out to be surprisingly disruptive to routine ongoing maintenance tasks, like password propagation and managing NFS mounts. I'd say that our scripts weren't designed for things exploding half-way through, but I'm not sure there's any way to design scripts to survive in a setting where any command and any internal Bourne shell operation might fail at any time.

(We are taking steps to mitigate some problems.)

Despite our recent experience with this, we're probably going to stick with strict overcommit on our compute servers; it has saved us in other situations, and there really isn't a better alternative. The OOM killer has its own problems and is probably the wrong answer for our compute servers, especially in its current form.

(There is an argument that can be made that the OOM killer's apparent current approach of 'kill the biggest process' is actually a fair one on a shared compute server, but I think it's at least questionable.)

PS: We saw this last fall, but at the time we didn't really fully appreciate the potential problems and how hard they may be to deal with.

Written on 05 February 2019.
« Hand-building an updated upstream kernel module for your (Fedora) kernel
Using a single git repo to compare things between two upstreams »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 5 23:44:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.